We Need DevOps for ML Data

softwaredoug · on April 28, 2020

In my experience, the problem has a lot to do with how teams organize around ML.

When you have engineering team separate than a data science team, you'll inevitably have unproductive conflict & politics. One team might be incentivized for stability and speed (engineering or ops) and the other model accuracy (data science). The end result can be disastrous... An engineering team that wants to bend nothing to help data scientists get their work in production. Or a data science team that only cares about maximizing accuracy, even if it might destroy prod, or be impractical to implement in a performant way.

To hit the sweet spot on accuracy, speed, and stability, you need to have one team that focuses on the end feature. It needs to be cross-functional and accountable for doing a great job at that feature. And the data scientists need to be possibly more focused on measuring and analyzing the feature's success, rather than just building models for models sake.

I'd recommend the book Agile IT Organization Design if you're interested in good team design patterns

tixocloud · on April 28, 2020

This. In my experiences across larger enterprises, data science teams rarely hold the key to production environments and therefore, relies heavily on IT to productionizing ML. And I completely agree that data scientists need to be focused on measuring and analyzing success as opposed to churning more and more models.

moandcompany · on April 28, 2020

In many cases, this is not just because production teams don't want data science teams to be able to deploy to production --due to lack of trust or confidence- but data science teams often don't want this responsibility.

softwaredoug · on April 28, 2020

There’s also perhaps a syndrome of not wanting to do the organizational work to do ML well. Instead of changing the whole org by integrating ML into every team, a data science team is hired doing god knows what. There’s status assigned to being the “data scientist” and they work away, siloed, on fun sounding deep learning models. In this mode, if they produce anything it’s impractical, divorced from the product realities, and rather hard for the main engineering/product org to maintain or implement.

The reality is there’s more work to embracing ML than hiring data scientists. Everyone needs to understand ML a little, and it needs to be OK to critically question data science work from product and engineering angles.

moandcompany · on April 28, 2020

Another aspect of this I've observed -- Personal sense of value (and industry pay goes into this) contributes to partitioning of work. If we're charitable, it's from a belief of comparative advantage, and if we are brutally honest about some people, it's because people often feel that "_____ isn't a good use of their time." This is also fed by the "sexiest job of the 21-st century" saying that's been created.

We see this in data science and machine learning where people complain about spending their time cleaning data, etc... when their time should be spent "generating insights/etc." We also see that those insights are interesting but not very useful if they aren't actionable, too costly or too impractical to implement.

Ultimate value is related to being able to contribute to and achieve the holistic outcome, but the lens of success is often focused on models or insights instead. This is a cultural and organizational problem, rather than a technological one. It also takes a dose of humility to appreciate the true value of the so-called dirty work.

softwaredoug · on April 29, 2020

I see this with my own work. I maintain the Elasticsearch Learning to Rank plugin. People assume it's all magic machine learning. The reality is much of the work involves understanding Elasticsearch plugins, informed by machine learning that needs to happen. Oh and 50% of the work is support and fun things like Maven repos :)

tixocloud · on April 29, 2020

Another point is that while we technologists love to marvel at data science and machine learning, it still begs the question of what value does it bring to the business. Does the added responsibility of creating all the infrastructure and processes worth it to justify a 5% increase in conversion rates? As you say, even the dirty work has a cost and whether that cost is worth paying to find out that there's nothing you can do to improve the business. That's why all the massive multi-year central data warehouse cleansing type projects keep failing without yielding much value. There's just a lack of focus on delivering incremental value with these data projects.

atupis · on April 29, 2020

I think currently it creates new possibilities to do business. Computer vision is that point that nowadays it is more engineering than data science so adding something like somewhat good object detection is not that hard. NLP is probably same point where cv was 5 years ago so we start seeing very good NLP models.

tixocloud · on April 29, 2020

Definitely. Even as an engineer working on CV 10 years ago, the hard part wasn’t object detection but rather network bandwidth to stream incredible amounts of data and processing it in real-time.

Spoke to an experienced engineer who used to lead NLP at MSFT and same comment. NLP models are already fantastic and it isn’t very hard to build a smart chatbot. The implementations these days are just very poor because they are not well thought out from a user perspective.

tixocloud · on April 28, 2020

Huge insight and definitely on point. Where I've worked, data science teams are focused on business impact. More responsibility requires larger budgets and at times, creates a burden. Plus I have a sense not a lot of senior executives know how to hire ML engineers in the first place as they come from a business background and would rather leave it to IT.

softwaredoug · on April 28, 2020

Exactly! Frankly I see so many naive assumptions on quantitatively measuring user behavior (like CTR means success!). I wish more time was spent robustly understanding the users behavior rather than just jumping to optimizing one unquestioned metric with a model.

Optimizing a loss function is far far easier than finding the right loss function(s)

winrid · on April 29, 2020

I agree! We had this org structure and had tons of problems. But certain small teams that worked cross functionality were very productive

c3534l · on April 29, 2020

> When you have engineering team separate than

Which is the whole idea behind DevOps: to break down the barriers between development and deployment by focusing on rapid iteration to production by continuously integrating changes into that pipeline.

It's ironic that DevOps has become a specialty in and of itself. The idea is to get rid of separate teams, not create a new one!

softwaredoug · on April 29, 2020

Oh I agree! And it's funny when I run into DevOps teams with that are rebranded Ops teams :)

simonw · on April 28, 2020

I see this as more of an organizational challenge than a technology challenge.

Getting ML models into production isn't particularly hard... if you put an engineering team on it that know how to write automated release procedures, design architecture that can scale and build robust APIs to surface the data.

But in many companies the engineers with those operations-level skills and the researchers who work on machine learning live completely separate lives. And then the researchers are expected to deploy and scale their models to production!

That's not to say that this organizational problem cannot be solved with technology/entrepreneurship. If a company can afford it it's likely much cheaper to pay an external company to solve your "ML in production" problems than to re-design your organization such that you equip your internal ML teams with the skills they need to go to prod.

calebkaiser · on April 28, 2020

I agree that a lot of the challenges around production ML are organizational, but I think in many companies, it has more to do with a lack of engineering resources than it does the separation of eng and data science (though that certainly happens).

Building and maintaining ML infrastructure from scratch is a big project. That's why you see FAANG companies hiring for ML infrastructure/platform engineers. Most startups don't have the extra cycles for that big of an undertaking, and so you see a lot of slapped-together, hacky solutions to putting models into production.

I'm biased in that I work on Cortex ( https://github.com/cortexlabs/cortex ), but I think that open source, modular tooling that removes the need to reinvent the wheel is going to have a big impact in terms of making production ML more accessible.

Cacti · on April 28, 2020

I disagree. It’s not about getting the data where it needs to be. It’s about data version control at a very fine level with very large datasets (in a way that is efficient). It’s about detecting changes in model results base don changes in data. It’s about tracking provenance of data in the datasets. It’s about potentially controlled access to the data (eg allowing models to use health care data without actually knowing the underlying data). It’s about detecting bias in datasets over time.

It’s actually quite complex, which is why generally speaking very few people do anything like this. I am unaware of any general solution to this problem, either in industry or academia.

tixocloud · on April 28, 2020

You've brought up a lot of very interesting points that we're actually looking to solve regarding data distribution changes, data version control and reproducibility, privacy guards and bias detections with our startup Orchestra (https://orchestrahq.com).

Would love to chat if you have further thoughts around the subject - there's a ton of problems we're looking to tackle in the space and would be good to get input.

kevinstumpf · on April 29, 2020

(Tecton CTO here) You’re absolutely right that ML projects can’t be solved with technology alone. Besides the right tooling, they also require process, organizational setup, buy-in from multiple stakeholders, etc. By itself, no technology will turn a company into an “ML-first” company. Both technology and organizational problems need to be solved.

A while back, we published a blog post that discusses how we approached these organizational challenges at Uber: https://eng.uber.com/scaling-michelangelo/. With Michelangelo, we found that the right tooling can both solve technical challenges and help with some organizational challenges. For example: If a standardized and centralized platform is the path of least resistance to get ML into production and solve your business problem, you get the organizational benefits of that centralization (governance/visibility/collaboration) along the way.

mmq · on April 28, 2020

> if you put an engineering team on it that know how to write automated release procedures

I think surfacing the data is just the first step, often times data scientists need to run some data exploration, the process is generally iterative, and so they need to run several experiments, resume or restart some experiments, scale training with distributed learning using several machines, or run hyper-parameters tuning, which means handling failures, visualize and debug results, before deciding if they should deploy a model. Once a model is deployed the story does not end there, because models become stale and need to be retrained. There are other issues related to compliance that need to be handled as well, and many other problems related to governance, a/b testing, ...

The good news is that there are several open source initiatives to solve several of these problems, at Polyaxon [0] we are trying to solve some of the aspects related to the experimentation phase.

[0] https://github.com/polyaxon/polyaxon

moandcompany · on April 28, 2020

Fig 4 looks like it's derived from Hidden Technical Debt in Machine Learning (2015).

https://papers.nips.cc/paper/5656-hidden-technical-debt-in-m...

As someone else says in this comment thread, this is very much an organizational problem, and cannot be viewed as just a technology problem.

The common behavior of individuals and teams is the pursuit of solutions that solve problems for them. Problems here with ML, and as we've seen with "Data Science," along with other magic technologies is that having an appreciation for the domain or context goes a long way. Being familiar with entire process, or "pipeline," is valuable, and role/functional silos often lead the problems people experience.

For some classes of machine learning problems and associated data, sourcing solutions from vendors can work, but as with any tools you can procure, you need the right people to use them appropriately. This also applies to "DevOps" which is used for comparison in the blog post.

--> DevOps example -- the philosophy seems to be about having software developers also share build/release and infrastructure responsibility. But some organizations have made "DevOps" teams to silo build/release and infrastructure work... they ended up renaming what used to be called their Build/Release or SysAdmin teams. Siloing things to be "someone else's" problem doesn't result in the major transformations that are needed.

Now imagine what happens if we substitute MLDevOps for DevOps above.

I'll continue to say "The Role of a Data Engineer on a Team is Complementary and Defined By The Tasks That Others Don’t (Want To) Do (Well)"

amznthrowaway5 · on April 29, 2020

> The Role of a Data Engineer on a Team is Complementary and Defined By The Tasks That Others Don’t (Want To) Do (Well)

Those types of tasks are also often not recognized or rewarded by management, despite being a hugely critical part of the system. I believe the incorrect hiring of scientists who are often strong in terms of core theory or number of papers published but have no clue about building real production ML systems is a huge organizational problem, often causing ML teams to fail to deliver any real value.

gas9S9zw3P9c · on April 28, 2020

Wow, I probably have seen 10 of these kind of companies over the past few months. Personally I believe (and hope) the winners in this space are going to be modular open-source companies/products as opposed to the "all-in-one enterprise solutions"

_mdb · on April 28, 2020

CEO of Tecton here, and happy to give more context. Tecton is specifically focused on solving a few key data problems to make it easier to deploy and manage ML in production. e.g.:

- How can I deliver these features to my model in production?

- How do I make sure the data I'm serving to my model is similar to what is trained on?

- How can I construct my training data with point in time accuracy for every example?

- How can I reuse features that another DS on my team built?

We've found that there's a ton of complexity getting data right for real-time production use cases. These problems can be solved, but require a lot of care and are hard to get right. We're building production-ready feature infrastructure and managed workflows that "just work" for teams that can’t or don’t want to dedicate large engineering teams to these problems.

At the core of Tecton is a managed feature store, feature pipeline automation, and a feature server. We’re building the platform to integrate with existing tools in the ML ecosystem.

We’re going to share more about the platform in the next few months. Happy to answer any questions. I’d also love to hear what challenges folks on this thread have encountered when putting ML into production.

bogomipz · on April 28, 2020

All of the open positions listed on your careers page appear to be broken. There is no field to upload or attach a CV when applying to any of the roles. Also why would a LinkedIn Profile be mandatory in order to apply for a role? There are many qualified people who have simply chosen not to be a part of that social network.

_mdb · on April 28, 2020

Ah. We're on it. LinkedIn shouldn't be required. Thanks for flagging.

jdoliner · on April 28, 2020

Pachyderm is probably one of the companies you've seen in this space. Full disclosure: I'm the founder, but I feel that we've stayed pretty true to the idea of being a modular open-source tool. We have customers who just use our filesystem, and customers who just use our pipeline system, and of course many more who use both. We've also integrated best in class open-source projects, for example Kubeflow's TFJob is now the standard way of doing Tensorflow training on Pachyderm, and we're working on integrating Seldon as the serving component. We find this architecture a lot more appealing than an all-in-one web interface that you load your data into.

gas9S9zw3P9c · on April 28, 2020

I haven't used you yet, but IMO this is the way it should be done. Once I get around to cleaning up my current custom k8s pipelines I will give you a spin :)

minimaxir · on April 28, 2020

Additionally, all of Google, Amazon, and Microsoft are pushing very heavily in the ML DevOps space. And if you are training/deploying ML models at such a frequency that you need to utilize DevOps, chances are you are already using their platforms for server compute.

0xbadcafebee · on April 28, 2020

Open Source companies are like open source car manufacturers. When the company dies and stops making the car, will the customers start a new car manufacturing business just to support their cars? Or buy a new car?

As AWS shows, proprietary all-in-one [platform] is fine as long as it's a-la-carte.

yanovskishai · on April 28, 2020

Could you please mention what are the other solutions you've got to see in this space?

mmq · on April 28, 2020

Polyaxon is an open source machine learning automation platform. It allows to schedule notebooks, tensorboards, and container workloads for training ML and DL. It also has native integration with Kubeflow's operators for distributed training.

https://github.com/polyaxon/polyaxon

verdverm · on April 28, 2020

https://dolthub.com is the cool kid right now. There is pacaderm, git lfs, IPFS.

Really what we need is version control for data, it's not just an ML data problem. It's a little different though, because you would like to move computation to data, rather than the other way around

wenc · on April 28, 2020

The utility of version controling production-sized (not sample training data) data (as opposed to code) is something I've having trouble grasping unless I'm missing something here -- and I may be, so please enlighten me.

It seems to me to be able to time-travel in data you almost need to store the Write-Ahead Log of database transactions and be able to replay that. Debezium captures the CDC information, but it's a infrastructure level tool rather than a version control tool.

In data science, most time-travel issues are worked around using bitemporal data modeling: which is a fancy way of saying "add a separate timestamp column to the table to record when the data was written". Then you can roll things back to any ETL point in a performant fashion. This is particularly useful for debugging recursive algorithms that get retrained every day.

But these are infrastructure level approaches. I'm not sure that it's a problem for a version control tool.

timsehn · on April 28, 2020

Tim , CEO of Liquidata, the company that built Dolt and DoltHub here. This is how we store the version controlled rows so that we get structural sharing across versions (ie. 50M + one row chgange becomes 50M+1 entries in the database not 100M with no need to replay logs):

https://www.dolthub.com/blog/2020-04-01-how-dolt-stores-tabl...

wenc · on April 28, 2020

Thanks, that looks like an interesting approach. I may have missed this in the article, but let's say I have a SQL database with 600m records, and an ETL process does massive upserts (20m records) every day, with many UPDATEs on 1-2 fields.

Wouldn't discovering what those changes are still entail heavy database queries? Unless Dolt has a hook into most SQL databases' internal data structures? Or WALs?

timsehn · on April 28, 2020

You have to move your data to Dolt. Dolt is a database. It's got its own storage layer, query engine, and query parser. Diff queries are fast because of the way the storage layer works.

Right now, Dolt can't be distributed (ie. data must fit on one hard drive) easily so it's not meant for big data, more data that humans interact with, like mapping tables or daily summary tables. But, long term if we can get some traction, we plan on building "big dolt" which would be a distributed version that can scale to as big as you want.

wenc · on April 28, 2020

Ah now I understand!

So for most analytic workloads, typically a columnstore db is used due to the need for performance and advanced SQL features (windowing functions) for complex analytic queries -- which I don't expect Dolt to replace. Which means if we wanted to use Dolt's features, we would have to continuously ETL the data into Dolt, which would entail mirroring the entire database (or at least the parts we want to version control).

Dolt essentially becomes a derived database specifically used for versioning. I see how this might work for some use cases.

seddonm1 · on April 28, 2020

If you are working within the Apache Spark ecosystem you can us DeltaLake https://delta.io/ to create 'merge' datasets which are transactional, versioned and allow time travel by both version number and timestamp.

jamesblonde · on April 29, 2020

Another alternative to Deltalake is Apache Hudi, which also includes bloom filters for indexing time-travel queries (efficiently exclude any files given the supplied time constraint). Z-ordered indexing in Deltalake is not available yet in open-source deltalake, only in Databricks version.

zachmu · on April 28, 2020

One of the cool things about Dolt is that you can query the diff between two commits. This functionality is available through special system tables. You specify two commits in the WHERE clause, and the query only returns the rows that changed between the commits. The syntax looks like:

`SELECT * FROM dolt_diff_$table where from_commit = '230sadfo98' and to_commit = 'sadf9807sdf'`

jacques_chester · on April 28, 2020

> In data science, most time-travel issues are worked around using bitemporal data modeling: which is a fancy way of saying "add a separate timestamp column to the table to record when the data was written".

Not quite, this is "transaction time". You also need "valid time" to be truly bitemporal. Recovering the database as of some point in time is not enough to answer questions like "when will this fact become false?" or "when did our belief about when it would become false change?", because you didn't preserve assertions about the time range over which the fact was held to be true.

In terms of implementations, ranges are better than double timestamps. They provide their own assertion of monotonicity and can be easily used in exclusion indices.

I found that Snodgrass's textbook was a good introduction to the concepts and it's available for free: https://www2.cs.arizona.edu/~rts/tdbbook.pdf

wenc · on April 28, 2020

Yes, you're correct -- an omission on my part. You need "valid time" (otherwise it's just "uni"-temporal modeling).

Thank you for the link to Snodgrass' book. I've not seen a formal book on temporal modeling in SQL before, so this is fascinating.

jacques_chester · on April 28, 2020

Glad I could help! The research seems to have puttered on for a while after this book was written, but appears to fizzle out by around the turn of the millennium.

Some notion of bitemporalism showed up in SQL 2011, but somewhat constrained compared to what Snodgrass describes.

sgt101 · on April 28, 2020

I worry about retraining every day. Isn't that a flag that says "It hasn't learned a thing and actually I'm just improving my backfitting score"?

wenc · on April 28, 2020

Not really -- in many forecasting applications in fast-changing markets, it is fairly common to dynamically retrain your recursive model to a moving window of historical data in order to adapt to your current environment (with some regularization). The length of the window depends on how fast the market changes.

For these types of recursive model applications, you cannot just fit the model once and forget about it.

slt2021 · on April 28, 2020

as long as it works well on out of sample data at deployment time, it is okay.

Until some major data drift happens, but you would notoce it anyways

sgt101 · on April 29, 2020

Honestly, I've heard people in Vegas tell me the same about their strategies vs. slots. Genuinely, if you have made money from this - well done, take it out now, congratulate yourself. If you haven't...

yanovskishai · on April 28, 2020

Thanks ! There are indeed players many new in the data versioning space (DVC and Quilt also probably worth mentioning).

I totally agree that data management problems are not just ML related. But I personally think that there are additional challenges in the space that are not just version control for data.. all the area of data quality management and monitoring for example. I liked the analogy to devops, source version was super critical problem to solve in software development, but it didn't stop there, with things like CI/CD etc. I believe we'll see similar evolution in the data space..

SirOibaf · on April 28, 2020

https://logicalclocks.com with their ML + Feature Store open source platform Hopsworks and their managed cloud version https://hopsworks.ai

jamesblonde · on April 28, 2020

Disclaimer: i am a co-founder of Logical Clocks. There are loads of interesting technical challenges in this "Feature Store" space. Here are just a few we address in Hopsworks:

1. To replicate models (needed for regulatory reasons), you need to commit both data and code. If you have only a few models, fine just archive the training data. But, if you have lots of models (dev+prod) and lots of data - you can't use git-based approaches where you commit metadata and make immutable copies of data. It scales (your data!) badly. We are following the ACID datalake approach (Apache Hudi), where you store diffs of your data and can issue queries like "Give me training data for these features as it was on this date".

2. You want one feature pipeline to compute features (not one for training and a different one when serving features). Your feature store should scale to store TBs/PBs of cached features to generate train/test data, but should also return feature vectors in single ms latency for online apps to make predictions. What DB has those characteristics? We say none, and we adopt a dual-DB approach with one DB for low-latency and one for scale-out SQL. We use open-source NDB and Hive on our HopsFS filesystem - where all 2 DBs and the filesystem share the same unified, scale-out metadata layer (a "rm -rf feature_group" on the filesystem also automatically cleans up Hive and feature metadata)

3. You want to be able to catalog/search for features using free-text search and have good exploratory data analysis. The systems challenge here is how to allow search on your production DB with your features. Our solution is that we provide a CDC API to our Feature Store, and automatically sync extended metadata to Elastic with an eventually consistent replication protocol. So when you 'rm -rf ..' on your filesystem, even the extended metadata in Elastic is automatically cleaned up.

4. You need to support reuse of features in different training datasets. Otherwise, what's the point? We do that using Spark as a compute engine to join features from tables containing normalized features.

References:

* https://www.logicalclocks.com/blog/mlops-with-a-feature-stor... * https://ieeexplore.ieee.org/document/8752956 (CDC HopsFS to Elastic) * http://kth.diva-portal.org/smash/get/diva2:1149002/FULLTEXT0... (Hive on HopsFS)

timsehn · on April 28, 2020

Here's a list of companies/tools in the Git for Data space:

https://www.dolthub.com/blog/2020-03-06-so-you-want-git-for-...

chaoyu · on April 28, 2020

I'm actually building a "modular open-source company/product" in the MLOps space:

BentoML https://docs.bentoml.org/en/latest/

simonw · on April 28, 2020

https://angel.co/companies?keywords=machine+learning+models+... lists a whole bunch of them.

dttos · on April 28, 2020

Composable https://composable.ai is another tool in this space

tixocloud · on April 28, 2020

Completely spot-on. Too many "all-in-one" platforms are just too rigid and with AI infrastructure tooling still in the early stages, the companies that adopt modular products will be able to capitalize on new advances.

alfalfasprout · on April 29, 2020

Yeah, we're releasing our platform as open source soon too... kinda feel bad for these guys but it'll be tough to compete with platforms that have a larger open source following and plenty of end-users.

fizixer · on April 28, 2020

I wonder what's the business model for teams/startups offering open-source solutions that they developed in-house.

remmargorp64 · on April 28, 2020

I was the main data science engineer at one of my previous companies. We used tools like airflow for running python scripts to import data, clean/transform it, train models, and even test various models against datasets. We also used Azure for similar things.

It's easy to do "dev ops" for machine learning. Basically, just automate everything and implement gatekeeping mechanisms along with active monitoring.

It's true, though. I had to cobble together a lot of custom things at the time, but it wasn't that hard to do.

nik_s · on April 28, 2020

I'm the CTO at a data science company, and this has been my experience too. I've been lucky enough to have quite a few engineers go from zero practical experience to being able to train and deploy complex ml solutions, and the most successful solutions have always involved a combination of just a couple of tools: - airflow and/or celery for running data extraction and transformation jobs - pandas and numpy for data wrangling - sklearn, xgboost, lightgbm, pytorch or tensorflow for training/inference - flask or Django to serve results

It's a handful of technologies, but they're (generally) mature, battle tested, and well documented.

softwaredoug · on April 29, 2020

Generally true. Though I will say that in larger orgs, you will occasionally get someone doing some ML they read a paper on that's not well supported by major tooling. I mean it's the same trend chasing you see in engineering...

starpilot · on April 29, 2020

Good god it's hard to do this at a non-tech company. MLOps would be great, but we don't really have "Ops," just IT, since our main business is not software. And we don't have Dev either, so we don't have anyone to really emulate on the inside. Our data scientists are foremost analysts who can write some Python, they don't know OO or memory optimizations or anything. They've never used a bash prompt or know what one is. Management thought we could orchestrate this huge waterfall schedule for a project and now it's falling apart as we open each new box of surprises...

proverbialbunny · on April 29, 2020

If you don't have a dev, how are you collecting any data to begin with?

kostas_f · on April 29, 2020

I 'll disagree with most comments that it's mainly an organizational problem. Creating tooling for things like:

- managing different data sources

- versioning data

- monitoring how new data affects the model

- testing that certain SLAs are met before new features are deployed

- ability to rollback

- data & model quality monitoring

is technically challenging.

Obviously there are engineers that will quickly hack something together and will falsely think that they have a good enough MLOps solution. I have been part of such teams.

Most companies are not Google, Facebook, or Uber. The large ones very often don't have the know-how to create a robust technical solution around this process, and even if they do it can take them years and the smaller ones lack both the resources and technical expertise.

I'm always looking for new ideas that can become successful business and when I saw the Uber Michelangelo here on HN a few years ago, I was thinking that selling similar tooling to other companies, had great potential. Seems that the right team to create that company was the one the built Michelangelo itself :)

smeeth · on April 28, 2020

I really find it difficult to put into words just how little I care to pay for a web ui so I can "manage" my data.

Data pipelines are a real problem though, and I'm very interested in what startups do with this space.

seddonm1 · on April 28, 2020

We have been thinking about these problems for a few years now and have built Arc https://arc.tripl.ai (fully open source.) which is an abstraction layer on top of Apache Spark to help end-users rapidly build and deploy data pipelines without having to know about #dataops. Ultimately we decided that giving users a decent interface https://github.com/tripl-ai/arc-starter (based on Jupyter Notebooks) and encouraging a 'SQL first' approach means we can give users flexibility but also have a standardised way of deploying jobs with many of the devops attributes (like logging and reliability). You can run Arc as a standard docker run command or using Argo Workflows https://argoproj.github.io/ on Kubernetes as the orchestrator as it plays nicely with Arc and is easy to build resilient pipelines (retries etc.)

factorialboy · on April 28, 2020

> Data pipelines are a real problem though

Can you please elaborate more, thanks.

prions · on April 28, 2020

Its not trivial to create and manage Data Pipelines if you care about scale, serving a wide range of inputs and outputs, or making this data easy to surface and spread throughout your org (i.e. making it actually useful to regular people).

"Static ETL" like running the same database load every day at 1:00am isn't a super challenging problem. Doing it across many tables with complex transformations and multiple steps easily can be. You really have to consider reliability, processing speed, failure methods and other problems that dont really arise until you hit a certain scale.

There's also the issue of what people want out of a Pipeline that's changing. If you want people to be ""data driven"", then that means they need easy access to potentially all of your company's data on an ad hoc basis. So now your boring ETL 1 am pipeline isnt really serving any of these new usecases.

How do you create flexible pipelines that can be created from any dataset on an ad hoc basis? This is where tools like Airflow or Prefect come in. Creating a platform that can create these types of Pipelines is a real problem.

And before you even ask yourself _how_ to process this data, you need to also ask _where_? If you want to do what I outlined above - making your data more accessible and easy to use - then you probably need to rework how you're storing your data. But Data Lakes (and others) are a whole topic in and of itself.

dttos · on April 28, 2020

Suggest you check out https://composable.ai for building out robust data pipelines

ska · on April 28, 2020

"We need fewer data scientists and more data janitors" - anon

dnautics · on April 29, 2020

Honest question (though I suppose the clickthroughs to the comments are likely to be a biased sample): is "getting to prod" really the gatekeeper/bottleneck for most ML? I would have thought "a model that works" is much harder, especially given how hyped the field is and how many people are trying to tackle problems that are I'll suited to the current batch of ML techniques.

Unless the issue here is data collection in prod to start training your model.

proverbialbunny · on April 29, 2020

>is "getting to prod" really the gatekeeper/bottleneck for most ML?

The most common bottleneck is collecting the right data. It can take years, or even a task force just to get the right data before the data scientist can begin.

>I would have thought "a model that works" is much harder

It depends how experienced the data scientist is. Early on into a project a data scientist can do a feasibility assessment. They should identify what is possible, and how possible. Sometimes some data science projects are heavy on the research side where where it can take 2 weeks to 3 months to figure out if something is possible. Sometimes the feasibility assessment ends up being incorrect and a goal is shown to be impossible.

Once research is done it usually takes 4 weeks to 6 months for a data scientist to build a model. The upper bound is rare and happens because of recursive refinement to increase accuracy, trying to get every last drop out of what is possible.

In contrast it can take months to years for the company to begin to collect the right data for a data scientist to be able to begin to do what benefits the company. Sometimes crowd source projects need to be created just to collect the required data. It then takes an average of 3 to 6 months for productionization if there is clever feature engineering in the model. Note: When I say productionization, I mean all the way to the end customer, so setting up and maintaining pipelines, frontend devs updating websites to add the service, and whatever else is necessary. There is more work involved on the production side, but it can be split up to multiple engineers.

dnautics · on April 29, 2020

That's exactly what I suspected (I work at a dl/ml hardware co), thanks!

Jaruzel · on April 29, 2020

Managing data is not an IT job. Data is just unformatted information, and should be managed and governed by those who are trained in Information Management: Modern day Librarians.

IT own the platform, and the software. They should never own the data as well.

alexilliamson · on April 29, 2020

I agree with this, and I think data librarian is a role that any "data-driven" company needs. IMO it makes a lot of sense for data scientists to fill that role, but I think that's an issue for many. Data scientists may think being a librarian and organizing the knowledge base is beneath them, or maybe management thinks it's beneath them. Execs tend to not care about the state of knowledge infrastructure as long as their reports get to them when they expect.

akarve · on April 28, 2020

This is close to home. Our approach to DevOps for ML Data is to use S3 as the git core and build immutable datasets and models on top of S3 object versioning. I wrote the piece below on "Versioning data and models for faster iteration in ML" earlier this year. The key idea is for every model iteration to be a pure function F(code, environment, data). Ideas welcome: https://medium.com/pytorch/how-to-iterate-faster-in-machine-...

Tehchops · on April 28, 2020

I'm reminded of: https://blog.acolyer.org/2019/06/03/ease-ml-ci/

iddan · on April 28, 2020

This startup is trying to build the next GitHub for ML Data: https://dagshub.com/

gunshai · on April 28, 2020

This seems pretty cool.

beckingz · on April 28, 2020

Data is hard to automate and standardized pipelines and processes are really helpful. This is interesting.

tkyjonathan · on April 29, 2020

Isn't this DataOps?

schnitsel · on May 7, 2020

I was thinking the same, it could be that the OP isn't familiar with the term yet.

pottertheotter · on April 28, 2020

Is this not just a really long advertisement?

flaxton · on April 29, 2020

Rule number one: define your terms as you introduce them. On and on about ML. But what is it?

I had to search to see it was Machine Learning.

How hard is it to define it the first time you use it?

I can bet lots of people were scratching their heads but didn’t bother to look it up or continue reading...

oplav · on April 29, 2020

Genuinely curious, did you think "ML" stood for anything else? My day to day work is not machine learning but if I ever see ML, "machine learning" is the first thing I think of.

Tommah · on April 29, 2020

There is also the ML family of programming languages: https://en.wikipedia.org/wiki/ML_(programming_language)

kevinstumpf · on April 29, 2020

Thanks for flagging! Fixed.