Designing Data-Intensive Applications

Subscribe: Apple Podcasts | Spotify | TuneIn | RSS

We start our deep dive into Joe’s favorite new book, Designing Data-Intensive Applications as Joe can’t be stopped while running downhill, Michael might have a new spin on #fartgate, and Allen doesn’t quite have a dozen tips this episode.

If you’re reading this via your podcast player, you can always go to https://www.codingblocks.net/episode120 to read these show notes on a larger screen and participate in the conversation.

Survey Says

News

Thank you to those that took time out of their day to leave us a review:
- Stitcher: Anonymous, jeoffman
How to get started with a SQL Server database using Docker:
- SQL Server Tips – Run in Docker and an Amazing SSMS Tip (YouTube)
- Sample SQL Server Database and Docker (YouTube)
Come see Allen at NDC { London } for your chance to kick him in the shins, where he will be giving his talk Big Data Analytics in Near-Real-Time with Apache Kafka Streams. (ndc-london.com)
John Deere – Customer Showcase: Perform Real-time ETL from IoT Devices into your Data Lake with Amazon Kinesis (YouTube)
There’s a new SSD sheriff in town and it’s the Seagate Firecuda 520 with a reported maximum 5,000 MB/s sequential reads and 4,400 MB/s sequential writes!!! (Amazon)
- Seagate Firecuda 520 1TB NVMe PCIe Gen4 M.2 SSD Review (TweakTown)
Get 40% off your Pluralsight subscription! (Pluralsight)
Joe was a guest on The Waffling Taylors, episode 59. (wafflingtaylors.rocks)

Designing Data-Intensive Applications

About this book

What is a data-intensive application per the book?

Any application whose primary challenge is:

The quantity of data.
The complexity of the data.
The speed at which the data is changing.

That’s in contrast to applications that are compute intensive.

Buzzwords that seem to be synonymous with data-intensive

NoSQL
Message queues
Caches
Search indexes
Batch / stream processing

This book is …

This book is NOT a tutorial on how to do data-intensive applications with a particular toolset or pure theory.

What the book IS:

A study of successful data systems.
A look into the tools / technologies that enable these data intensive systems to perform, scale, and be reliable in production environments
Examining their algorithms and the trade-offs they made.

Why read this book?

The goal is that by going through this, you will be able to understand what’s available and why you would use various methods, algorithms, and technologies.

While this book is geared towards software engineers/architects and their managers, it will especially appeal to those that:

Want to learn how to make systems scalable.
Need to learn how to make applications highly available.
Want to learn how to make systems easier to maintain.
Are just curious how these things work.

“[B]uilding for scale that you don’t need is wasted effort and may lock you into an inflexible design.”
Martin Kleppmann

While building for scale can be a form of premature optimization, it is important to choose the right tool for the job and knowing these tools and their strengths and weaknesses can help you make better informed decisions.

Most of the book covers what is known as “big data” but the author doesn’t like that term for good reason: “Big data” is too vague. Big data to one person is small data to someone else.

Instead, single node vs distributed systems are the types of language used in the book.

The book is also heavily biased towards FOSS (Free Open Source Software) because it’s possible to dig in and see what’s actually going on.

Are we living in the golden age of data?

There are an insane number of really good database systems.
The cloud has made things easy for a while, but tools like K8s, Docker, and Serverless are making things even easier.
Tons of machine learning services taking care of common use cases, and lowering the barrier to entry: NLP, STT, and Sentiment analysis for example.

Reliability

Most applications today are data-intensive rather than compute-intensive.
- CPU’s are usually not the bottleneck in modern day applications – size, complexity, and the dynamic nature of data are.
Most of these applications have similar needs:
- Store data so the application can find it again later.
- Cache expensive operations so they’ll be faster next time.
- Allow user searches.
- Sending messages to other processes – stream processing.
- Process chunks of data at intervals – batch processing.
Designing data-intensive applications involve answering a lot of questions:
- How do you ensure the data is correct and complete even when something goes wrong?
- How do you provide good performance to clients even as parts of your system are struggling?
- How do you scale to increase load?
- How do you create a good API?

While reading this book, think about the systems that you use: How do they rate in terms of reliability, scalability, and maintainability?

What does it mean for your application to be reliable?

The application performs as expected.
The application can tolerate a user mistake or misuse.
The performance is good enough for the expected use case.
The system prevents any unauthorized access or abuse.

So in short – the application works correctly even when things go wrong.

When things go wrong, they’re called “faults”.

Systems that are designed to work even when there are faults are called fault-tolerant or resilient.

Faults are NOT the same as failures: a fault did something not to spec, a failure means a service is unavailable.

The goal is to reduce the possibility of a fault causing a failure.
It may be beneficial to introduce or ramp up the number of faults thrown at a system to make sure the system can handle them properly.
- You’re basically continually testing your resiliency.
- Netflix’s Chaos Monkey is an example.
The book prefers tolerating faults over preventing them (except in case of things like security), and is mostly aimed at building a system that is self healing or curable.

Hardware Faults

Typically, hardware failures are solved by adding redundancies:

Dual power supplies,
RAID configuration,
Hot swappable CPUs or other components.

As time has marched on, single machine resiliency has been deprioritized in favor of elasticity, i.e. the ability to scale up / down more machines. As a result, systems are now being built to be fault tolerant of machine loss.

Software Errors

Software errors usually happen by some weird event that was not planned for and can be more difficult to track down than hardware errors. Examples include:

Runaway processes can use up shared resources,
Services slow down,
Cascading failures that trigger a fault(s) in another component(s).

Human Errors

Humans can be the least reliable part of any system. So, how can we make systems reliable in spite of our best efforts to crash them?

Good UI’s, APIs, etc.
Create fully featured sandbox environments where people can explore safely.
Test thoroughly.
Allow for fast rollbacks in case of problems.
Amazing monitoring.
Good training / practices.

How important is reliability?

Obviously there are some situations where reliability is super important (e.g. nuclear power plant).
Other times we might choose to sacrifice reliability to reduce cost.
But most importantly, make sure that’s a conscious choice!

Resources We Like

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems by Martin Kleppmann (Amazon)
Stefan Scherer has a Docker image for all your Windows needs (hub.docker.com)
Aspectacular with Vlad Hrybok – You down with AOP? (episode 9)
Grokking the System Design Interview (Educative.io)
- Designing a URL Shortening service like TinyURL (Educative.io)
Looking for inspiration for your battlestation? Check out r/battlestations! (Reddit)
Chaos Monkey by Netflix (GitHub)
Pokemon Sword and Shield Are Crashing Roku Devices (GameRant)
#140 The Roman Mars Mazda Virus by Reply All (Gimlet Media)
Troubleshoot .NET apps with auto-correlated traces and logs (Datadog)
Your hard drives and noise:
- Shouting at disks causes latency (Reddit, YouTube)
- A Loud Sound Just Shut Down a Bank’s Data Center for 10 Hours (Vice)
- Beware: Loud Noise Can Cause Data Loss on Hard Drives (Ontrack)
Backblaze Hard Drive Stats Q2 2019 (Backblaze)
Failure Trends in a Large Disk Drive Population (static.googleusercontent.com)

Tip of the Week

When using NUnit and parameterized tests, you should prefer IEnumerable<TestCastData> over something like IEnumerable<MyFancyObject> because TestCastData includes methods like .Explicit() giving you more control over each test case parameter. (NUnit wiki on GitHub)
Not sure which DB engine meets your needs? Check out db-engines.com.
Search your command history with CTRL+R in Cmder or Terminal (and possibly other shells). Continue pressing CTRL+R to “scroll” through the history that matches your current search.
Forgo the KVM. Use Mouse without Borders instead. (Microsoft)
- Synergy – The Mac OS X version that Michael was referring to. Also works across Windows and Linux. (Synergy)
usql – A universal command-line interface for PostgreSQL, MySQL, Oracle, SQLite3, SQL Server, and others. (GitHub)

Share the joy

Designing Data-Intensive Applications – Reliability

Sponsors

Survey Says

What is the single most important piece of your battlestation?

News