Tuesday, July 5, 2011

Protocol Buffers & Avro


I am consuming service from world's biggest market place for my site inventory. Current SLA for their services - ~200msec. I was stunned on hearing their new SLA for us ~20msec. Heard about their optimization happened on data serialization and digged little bit today to upgrade myself.

Most server-client platforms use a serialization technique to serialize into a leaner data format, and then de-serialize on the receiving end.Many languages offer native serialization APIs, but when serializing the data using the native API, Metadata about the class is serialized into the output too. is it possible to serialize only data and not with metadata?

Yes we have - Google Protocal Buffers & Apache Avro

Google Protocol Buffers
Protocol Buffers is a serialization format with an interface description language developed by Google. It is available under free software, open source license. Protocol Buffers design goals are emphasized performance and simplicity. It is a language and platform neutral technology that is an extensible mechanism for serializing structured data.

It works by you defining how you want your data to be structured via proto files, which are simply structure text files. Once you have decided the structure in your proto file, the proto executable is called on it, and a generated class (Adobe Actionscript 3, Java, C, C++, Python) is produced. The class can be generated into multiple different technologies, which means the class can be generated for the client and server technologies. Thus securing a data contract (which is type safe) between the two. The protocol buffer technology provides the ability to update the data structure without breaking deployed programs that are compiled against the old format.

Protocol buffers claims it takes between 100 to 200 nanoseconds to parse. As the overhead of the data structure is not needed in protocol buffers, only the object fields’ values is serialized. Protocol buffers will find the most compact serialization technique for a particular data type (always primitives), and only serialize fields that are not null.

Apache Avro
Avro is another very recent serialization system. It provides rich data structures that are compact, and are transported in a binary data format.

Avro relies on a schema-based system that defines a data contract to be exchanged. When Avro data is read, the schema used when writing it is always present. Similar to Protocol Buffers, it is only the values in the data structure that are serialized and sent. The strategy employed by Avro (and Protocol Buffers), means that a minimal amount of data is generated, enabling fast transport.

The schemas are equivalent to protocol buffers proto files, but they do not have to be generated. The JSON format is used to declare the data structures.

Protocol Buffers & Avro

Google’s Protocol buffer provides a much richer API for defining a data contract than Avro. Below is a list of features available to Protocol Buffers and not Avro:


  • Declare nested types

  • Define requires, repeated and optional fields

  • Specify default values on fields

  • Declare enumerations and set a fields default value from it

  • Multiple message types in the same document

  • Import other proto files

  • Declare a range of field numbers in a message available for third party extensions (Extensions) Nested Extensions

  • Define services

Avro is only compatible with C, Java and Python, but Protocol Buffers is compatible with C, C++, Adobe Actionscript 3, Java and Python.


Now you would have guessed what that company applied for their middleware solution as a push mechanism between server and client.



2 comments:

Zen said...

Awesome topic, terms are little new to me but worth a read to enrich my technical stuff..keep posting such topics very useful for lot people out here....

Sarav said...

sure Senthil