Podcast: Play in new window | Download
Subscribe: Apple Podcasts | Spotify | TuneIn | RSS
We continue to study the teachings of Designing Data-Intensive Applications, while Michael’s favorite book series might be the Twilight series, Joe blames his squeak toy chewing habit on his dogs, and Allen might be a Belieber.
Head over to https://www.codingblocks.net/episode121 to find this episode’s full show notes and join in the conversation, in case you’re reading these show notes via your podcast player.
Sponsors
- Educative.io – Level up your coding skills, quickly and efficiently. Visit educative.io/codingblocks to get 20% off any course or, for a limited time, get 50% off an annual subscription.
Survey Says
News
- So many new reviews to be thankful for! Thanks for making it a point to share your review with us:
- iTunes: bobby_richard, SeanNeedsNewGlasses, Teshiwyn, vasul007, The Jin John
- Stitcher: HotReloadJalapeño, Leonieke, Anonymous, Juke0815
- Joe was a guest on The Waffling Taylors. Check out the entire three episode series: Squidge The Ring, Jay vs Doors, and Exploding Horses.
- Be aware of “free” VPNs.
- Allen will be speaking at NDC { London } where he will be giving his talk Big Data Analytics in Near-Real-Time with Apache Kafka Streams. Be sure to stop by for your chance to kick him in the shins! (ndc-london.com)
Scalability
- Increased load is a common reason for degradation of reliability.
- Scalability is the term used to describe a system’s ability to cope with increased load.
- It can be tempting to say X scales, or doesn’t, but referring to a system as “scalable” really means how good your options are.
- Couple questions to ask of your system:
- If the system grows in a particular way (more data, more users, more usage), what are our options for coping?
- How easy is it to add computing resources?
Describing Load
- Need to figure out how to describe load before before we can truly answer any questions.
- “Load Parameters” are the metrics that make sense to measure for your system.
- Load Parameters are a measure of stress, not of performance.
- Different parameters may matter more for your system.
- Examples for particular uses may include:
- Web Server: requests per second
- Database: read/write ratio
- Application: # of simultaneous active users
- Cache: hit/miss ratio
- This may not be a simple number either. Sometimes you may care more about the average number, sometimes only the extremes.
Describing Performance
- Two ways to look at describing performance:
- How does performance change when you increase a load parameter, without changing resources?
- How much do you need to increase resources to maintain performance when increasing a load parameter?
- Performance numbers measure how well your system is responding.
- Examples can include:
- Throughput (records processed per second)
- Response time
Latency vs Response Time
- Latency is how long a request is waiting to be handled, awaiting service.
- Response time is the total time it takes for the client to see the response, including any latency.
What do you mean by “numbers”?
- Performance numbers are generally a set of numbers: i.e. minimum, maximum, average, median, percentile.
- Sometimes the outliers are really important.
- For example, a game may have an average of 59 FPS but that number might drop to 10 FPS when garbage collection is running.
- Average may not be your best measure if you want to see the typical response time.
- For this reason it’s better to use percentiles.
- Consider the median (sort all the response times and the one in the middle is the median). Median is known as the 50th percentile or P50.
- That means that half of your response times will be under the 50% mark and half will be over.
- To find the outliers, you can look at the 95th, 99th and 99.9th percentiles, aka P95, P99, P999.
- If you were to look at the P95 mark and the response time is 2s, then that means that 95% of all requests come back in under 2 seconds and 5% of all requests come back in over 2s.
- The example provided is that Amazon describes response times in the P999 even though it only affects 1 in 1000 users. The reason? Because the slowest response times would be for the customers who’ve made the most purchases. In other words, the most valued customers!
- If you were to look at the P95 mark and the response time is 2s, then that means that 95% of all requests come back in under 2 seconds and 5% of all requests come back in over 2s.
- Consider the median (sort all the response times and the one in the middle is the median). Median is known as the 50th percentile or P50.
- Increased response times have been measured by many large companies in regards to completions of orders, leaving sites, etc.
- Ultimately, trying to optimize for P999 is very expensive and may not make the investment worth it.
- Not only is it expensive but it’s also difficult because you may be fighting environmental things outside of your control.
- Percentiles are often used in:
- SLO’s – Service Level Objectives
- SLA’s – Service Level Agreements
- Both are contracts that draw out the expected performance and availability of a service.
- Queuing delays are a big part of response times in the higher percentiles.
- Servers can only process a finite amount of things in parallel and the rest of the requests are queued.
- A relatively small number of requests could be responsible for slowing many things down.
- Known as head-of-the-line blocking.
- For this reason – it’s important to make sure you’re measuring client side response times to make sure you’re getting the full picture.
- In load testing, the client application needs to make sure it’s issuing new requests even if it’s waiting for older ones. This will more realistically mimic the real world.
- Known as head-of-the-line blocking.
- For applications that make multiple service calls to complete a screen or page, slower response times become very critical because typically the user experience is not good even if you’re just waiting for one out of 20 requests. The slowest offender is typically what determines the user experience.
- Compounding slow requests is known as tail latency amplification.
- Monitoring response times can be very helpful, but also dangerous.
- If you’re trying to calculate the real response time averages against an entire set of data every minute, the calculations can be very expensive.
- There are approximation algorithms that are much more efficient, such as forward decay, t-digest, or HdrHistogram.
- Averaging percentiles is meaningless. Rather you need to add the histograms.
- If you’re trying to calculate the real response time averages against an entire set of data every minute, the calculations can be very expensive.
Coping with Load
How do we retain good performance when our load increases?
- An application that was designed for 1,000 concurrent users will likely not handle 10,000 concurrent users very well.
- It is often necessary to rethink your architecture every time load is increased significantly.
- The options, simplified:
- Scaling up – adding more hardware resources to a single machine to handle additional load.
- Scaling out – adding more machines to handle additional load.
- One is not necessarily better than the other.
- Scaling up can be much simpler and easier to maintain, but there is a limit to the power available on a single machine as well as the cost ramifications of creating an uber-powerful single machine.
- Scaling out can be much cheaper in hardware costs but cost more in developer time and maintenance to make sure everything is running as expected
- Rather than picking one over the other, consider when each makes the most sense.
- Elasticity is when a system can dynamically resize itself based on load, i.e. adding more machines as necessary, etc.
- Some of this happens automatically based on some load detecting criteria.
- Some of this happens manually by a person.
- Manually may be simpler and can protect against unexpected massive scaling that may hit the wallet hard.
- For a long time, RDBMS’s typically ran on a single machine. Even with a failover, the load wasn’t typically distributed.
- As distributed systems are becoming more common and the abstractions are being better built, things like this are changing, and likely to change even more.
“[T]here is no such thing as a generic, one-size-fits-all scalable architecture …”
Martin Kleppmann
- Problems could be reads, writes, volume of data, complexity, etc.
- A system that handles 100,000 requests per second at 1 KB in size is very different from a system that handles 3 requests per minute with each file size being 2 GB. Same throughput, but much different requirements.
- Designing a scalable system based off of bad assumptions can be both wasted time and even worse counterproductive.
- In early stage applications, it’s often much more important to be able to iterate quickly on the application than to design for an unknown future load.
Resources We Like
- Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems by Martin Kleppmann (Amazon)
- Grokking the System Design Interview (Educative.io)
- HdrHistogram: A High Dynamic Range Histogram (hdrhistogram.org)
- Performance (Stack Exchange)
- AWS Snowmobile (aws.amazon.com)
- Monitoring Containerized Application Health with Docker (Pluralsight)
Tip of the Week
- Get back into slack: Instead of typing in your slack room name, just click the “Find your workspace” link and enter your email – it’ll send you all of the workspaces linked to that email.
- Forget everything Michael said about CodeDOM in episode 112. Just use Roslyn. (GitHub)
- The holidays are the perfect time for Advent of Code. (adventofcode.com)