Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Streams have unknown size and may be infinite.

Batches have a known size and it are not infinite.



Maybe I'm using the wrong definitions, but I think that's backwards.

Say you are receiving records from users and different intervals and you want to eventually store them in a different format on a database.

Streaming to me means you're "pushing" to the database according to some rule. For example, wait and accumulate 10 records to push. This could happen in 1 minute or in 10 hours. You know the size of the dataset (exactly 10 records). (You could also add some max time too and then you'd be combining batching with streaming)

Batching to me means you're pulling from the database. For example, you pull once every hour. In that hour, you get 0 records or 1000 records. You don't know the size and it's potentially infinite


It’s because you’re looking at it from opposing ends.

From the perspective of the data source, in a streaming context, the size is finite — it’s whatever you’re sending. From the data sink’s perspective, it’s unknown how many records are going to get sent in total.

Vice versa, in a batch context, the data source has no idea how many records will eventually be requested, but the data sink knows exactly the size of the request.

That is, whoever is initiating the job knows what’s up, and whoever is targeted just has to deal with it.

But generally I believe the norm is to discuss from the sink’s perspective, because the main interesting problem is when the sink has to deal with infinity (streaming). When then source deals with infinity (batch), it’s fairly straightforward to manage — refuse requests of too large a size and move on. The data isn’t going anywhere, so the sink can fix itself and re-request. You do that with streaming and data starts getting lost


In part I think that is because the sink can run out of memory, the store has already allocated enough memory.


I work with batch oriented store and forward systems and they definitely push data in batches.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: