Apache Avro is an open source data serialization system that lets you send information. It is frequently associated with “big data” and distributed systems because it has some distinct advantages over the competition.
The primary advantages are listed below, but read on for more information:
- Messages are highly efficient
- Strong schema support
- Schema versioning
- Dynamic schema support
- Different, but compatible, Reader and Writer versions allowed
- Union types
- Object container files can include schema along with encoded records
- Support for serialization/deserialization with human-readable JSON
Why not human readable formats?
There are many different ways of serializing and deserializing data, and they each have their own advantages. JSON is probably the most common format that developers deal with today, but XML is still king in some tech stacks. The prime benefit of these formats is that it easy for humans to read, making development and debugging much easier.
However, the same things that make formats like JSON and XML human readable (like labels for each element or attribute) are the same things that make it grossly inefficient.
Don’t believe me? Here’s a link to the Pokemon pokedex in json format. The labels (“id”, “name”, “Attack”, “Sp. Attack”) make up more data than the data does!
Pros:
- Human readable / writable
- Optional schema
- Widely supported
- Great for light tasks
- Schemas are optional
Cons:
- Computationally inefficient
- Schemas are optional
What about Protocol Buffers?
Formats like Protocol Buffers are not human readable, but they are dramatically more efficient. The idea is that you can create a schema that defines the shape of the data, and you can share that schema with anybody who needs to decode your data.
This allows us to dramatically reduce the amount of space per message. The schema also id based, which offers a limited kind of schema evolution where you can add or remove (as long as they aren’t required) new fields without breaking anything.
Pros:
- Highly efficient
- Most major languages have 3p library support for it
- Limited support for schema evolution
- Required schemas can serve as documentation
Cons:
- Schemas must be shared between serializers and deserializers
- Can’t add or remove required fields
- Schemas are not directly versioned
- Complicated for light tasks
Why Avro?
I listed all the reasons above, and I’ll list them again below, but ultimately it boils down to having awesome schema support.
Having a stand-alone, versioned schema allows Avro to keep the minimum amount of information in it’s messages – making them highly compact. The schema support also allows for different readers and writers to negotiate on which schemas they support, which bakes a lot of flexibility in at the lowest level.
There are some additional niceties that go along with this too, like making it easier to support dynamic schemas rather than compiling them in with your code and having to figure out how to share the schema across projects. Sure, you could build this on top of another format…and also build that support into all of the tools you support, but why do that when you could use a popular, agreed upon standard?
Pros:
- Messages are highly efficient
- Strong schema support
- Schema versioning
- Dynamic schema support
- Different, but compatible, Reader and Writer versions allowed
- Union types
- Object container files can include schema along with encoded records
- Support for serialization/deserialization with human-readable JSON
Cons:
- Support is not as wide as other formats
- Complicated for light tasks
Read a much better comparison here: https://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html
And then go buy the book!