SiriDB
The development of SiriDB
Jeroen, one of our software developers, started developing SiriDB and created the skeleton for SiriDB in December 2014. For our infrastructure monitoring system we needed a system that could make data such as memory, disk and cpu usage clearly visible. At that point we were using google’s BigQuery, however, google’s BigQuery is good at processing large amounts of data, showing this data in graphs was not a strong point. We needed a database that was able to process large amounts of input data and was equally skilled at retrieving data (and the possibility to aggregate this data).
The proof-of-concept was developed during Christmas 2014, we used Python for this version. After this development we continued working on SiriDB for a year. In this year we developed an own query language and improved the scalability of the database. After this year we had a version that was scalable on the fly (multiple pools) and it was possible to save the data in a robust way. We created the robustness of the database by always having two servers in one pool. This creates extra performance when both servers are online and when one is turned of the database is still functioning. This is useful because it allows us to install SiriDB updates while running without disrupting operations.
Now that we had this version, we started thinking about how we could further increase the performance and how to get the memory use down. Python has a considerable overhead on both the performance and memory usage. Where speed was really important, for example the aggregation functions, we had already written the code in native C. Then we decided to write the entire database in C as well. After one year, we completed the first version of SiriDB completely written in C.
How does SiriDB work?
A SiriDB cluster consists of a minimum of one server. When your database grows, you can set up a new server that lets you expand your current database. When creating a new pool, SiriDB automatically distributes the existing data over these two pools evenly. This all happens in the background while the database remains functioning. This process can be repeated and SiriDB will distribute its data each time. The algorithm works so that data is never moved back to its 'previous' pool. When you go from two to three pools, each of the two pools will be moving a third of its data to the new pool.
A SiriDB server only has knowledge of the series in their own pool. The server only knows that if a series exists, in which pool that these should be present. That way we can send queries and inserts efficiently to the correct server. In order to make SiriDB robust we can provide each pool with two servers. The moment you decide to add a second server to a pool, all required data will be synchronized in the background. When this process is finished, the new server is fully functional and will be used to answer queries and handle inserts.
We parse queries with libcleri (a parser that we developed ourselves). Data is sent internally by using QPack (a self-developed message serializer).
Presentation T-Dose
We’re happy to be sharing our knowledge about open source and time series databases with the open source community.
We’ll be sharing what decisions we faced during the development process, the pitfalls and the techniques we eventually settled for. We’ll break down some popular measurements for time series and see how they can be implemented in SiriDB. We'll also look at the various ways that SiriDB can be applied in different environments. A SiriDB datasource plugin for Grafana will also be explained and demonstrated.