When you think about scaling your service or application, don’t forget about how your logs will grow. Think about this impact on CPU and disk usage. If you wrote your own agent to ingest your logs, don’t forget to scale up your logging service.
Source: peak traffic event... person who did back of the envelope math based on p99 request size forgot about logs and specifically the increased log event volume.
Machines were about to tip over. I recall having to ssh to hosts to manually kill the log agent process because the logging ingestion service was shitting itself (not properly throttling either) and we had no other levers in place. Lolz did I mention deleting logs on live hosts because logs were just accumulating and not getting cleared off? Now imagine this across 20,000 hosts. Teehee.
Source: peak traffic event... person who did back of the envelope math based on p99 request size forgot about logs and specifically the increased log event volume.
Machines were about to tip over. I recall having to ssh to hosts to manually kill the log agent process because the logging ingestion service was shitting itself (not properly throttling either) and we had no other levers in place. Lolz did I mention deleting logs on live hosts because logs were just accumulating and not getting cleared off? Now imagine this across 20,000 hosts. Teehee.