Cover Image for Blockchain Data Warehousing: Challenges in Storing Blockchain Data

Blockchain Data Warehousing: Challenges in Storing Blockchain Data

Blockchain
Data Warehousing

Managing and warehousing blockchain data from multiple chains is a complex task, fraught with unique challenges that can impact the stability and reliability of data management systems. In this blog, we'll discuss these challenges and see how teams like Bitquery tackle these issues.

So how do people store blockchain data?

To understand blockchain warehousing, we have to first understand the logistics of storing blockchain data on a smaller scale.

The key to maintaining a reliable record of blockchain transactions resides in running what are known as "nodes." While there are different types of nodes, our emphasis is on full nodes and archive nodes because of their comprehensive data-downloading capability from the genesis block onwards.

At this time of writing, geth node takes up to 22 TiB ( ~ 24 TB) of storage space and growing at the rate of 84 TiB per month.

Now imagine storing this data at scale for multiple blockchains (40+, in the case of blockchains Bitquery indexes) and serving the parsed/usable information in real-time. This brings a lot of problems regarding data at scale, which we will discuss below.

High Data Volume

One of the challenges associated with data duplication in blockchain data warehousing arises from the redundant storage of information across nodes. As more transactions are executed on the network, more data is created, necessitating an increase in storage capacity. This can exponentially increase the demand for storage space as the number of transactions scales up.

Preventing Server Overload: How to manage resources efficiently?

The sheer volume of data generated by blockchains can quickly fill up disk space, leading to servers reaching full capacity unexpectedly. When servers crash or reach full capacity, the difficulty in transferring massive amounts of data to new storage solutions can be considerable.

To avoid these issues, the use of distributed ledger technologies allows the replication of databases over numerous nodes, ensuring no data loss even if some nodes fail. However, this replication does not mitigate the need for ample disk space.

Transitioning from the issue of storage, another challenge in handling extensive blockchain data comes in the form of rate limits on APIs.Users may encounter rate limits on the API and continue querying, resulting in a poor experience when faced with "Too many simultaneous queries". However, Bitquery's IDE is designed to efficiently handle potential rate limits on API, providing users with a seamless experience.

Let's consider a couple of examples. First, a query that retrieves trades by an address executes in less than a second (approximately 600 ms). Achieving such speed requires a highly optimized and efficient query engine to fetch indexed data. Another example is a query that retrieves the balance of an address across multiple chains. This query consumes the following resources:

It's interesting to see how the CPU virtual time and SQL request count can impact the resources consumed by a query. As the number of records increases such as the number of chain databases queried or the number of tokens, the demand for resources will also increase.

ETL Processes and Reindexing Challenges

Sometimes, it may be necessary to reindex blockchain data because of the unavailability of older data, finding a complete node or even free disk space with all the required data can be challenging, especially if older data becomes unavailable over time. The re-indexing process itself is resource-intensive, often requiring substantial computational power and time to accomplish.

The size of the database for a chain per server can vary depending on the specific server and the chain data being stored. Generally, larger databases like Solana, Tron, or Ethereum data require more resources and can significantly impact server performance.

We should focus on optimizing the database instance size and implementing an effective monitoring system. For example, when disk usage hits a high level, like 95%, a warning alert could be triggered. This alert could assist your operations team in managing storage effectively.

Scheduling fixed service downtimes is another strategy that could be beneficial. Predetermined start and end times can help manage resources efficiently, ensuring minimal disruption to the user experience.

Monitoring Challenges

When it comes to monitoring multiple blockchains, the key challenge lies in the diverse nature of the chains. Different blockchains may have varying performance metrics, consensus algorithms, and network loads.

It is important to set up dashboards to monitor every node for delay or in case it goes offline. This is an important step in ensuring the stability and reliability of data management systems, especially when dealing with multiple blockchains. We ensure that people using the Bitquery APIs don't face network timeouts and socket hang-ups by adding proper monitoring solutions for our blockchain nodes.

The alerting system periodically queries data sources and evaluates the conditions defined in the alert rule and notifications are sent out to the specified contact points in the notification policy.

Mitigating Risks During Blockchain Upgrades and Forks

Cnode version upgrades and blockchain forks can also introduce significant challenges. Incompatible upgrades can cause errors and discrepancies in data, necessitating careful planning and execution of these changes to maintain data integrity and continuity. These upgrades often require synchronization across multiple nodes to prevent data fragmentation or loss.

At Bitquery, we handle it by keeping track of new releases. To ensure our nodes are up to date, we manually update the code since we have forks of nodes.

Data Quality Assurance

Maintaining quality has one important aspect, blockchain data is not static, and when a new transaction is added, it is indexed by blockchain data providers.

The growth rate of different blockchains can vary significantly, with some chains growing faster or slower than others. For instance, EOS currently has a growth rate of about 7 MiB per hour, while Celo and Flow are growing at rates of about 38 MiB per hour.

It's important to keep track of these growth rates to ensure that data management systems are adequately equipped to handle the increasing amount of data being generated. One factor that could lead to a rise in storage sizes beyond the expected levels is the duplication of data. For instance, if one node stores the same transactional data multiple times, it amplifies the storage requirements to accommodate the duplicated data. This issue could escalate when the operations are executed at a large scale.

Providing Access to the Data

Different databases support various query languages such as SQL and graphQL. These query languages enable users to query the database. However, when it comes to blockchain data, different types of data such as transactions, trades, and blocks need to be parsed before they can be queried. At Bitquery, we support both SQL and graphQL queries for blockchain data.

To ensure optimal performance and prevent system overloads, we've set parameters like:

  • Maximum Query Execution Time: This limit ensures that each query is executed within a reasonable time frame, maintaining system efficiency.
  • Max Query Rows: This cap prevents overwhelming the system with excessively large data returns.
  • Connection Timeout Failover: This timeout setting ensures a swift failover connection if the primary connection fails, providing a reliable and consistent user experience.

Conclusion

In this article, we had a look at why blockchain data warehousing is a complete task and requires meticulous planning. Server issues, monitoring challenges, and keeping up with chain upgrades and releases are some of the challenges impacting the stability and reliability of your data management systems. By carefully managing resources, optimizing database size, monitoring nodes, and handling upgrades and releases, Bitquery ensures that users have access to high-quality multichain data with minimal downtime.


About Bitquery

Bitquery is your comprehensive toolkit designed with developers in mind, simplifying blockchain data access. Our products offer practical advantages and flexibility.

  • APIs - Explore API: Easily retrieve precise real-time and historical data for over 40 blockchains using GraphQL. Seamlessly integrate blockchain data into your applications, making data-driven decisions effortless.

  • Coinpath® - Try Coinpath: Streamline compliance and crypto investigations by tracing money movements across 40+ blockchains. Gain insights for efficient decision-making.

  • Data in Cloud - Try Demo Bucket: Access indexed blockchain data cost-effectively and at scale for your data pipeline. We currently support Ethereum, BSC, Solana, with more blockchains on the horizon, simplifying your data access.

  • Explorer - Try Explorer: Discover an intuitive platform for exploring data from 40+ blockchains. Visualize data, generate queries, and integrate effortlessly into your applications.

Bitquery empowers developers with straightforward blockchain data tools. If you have questions or need assistance, connect with us on our Telegram channel or via email at sales@bitquery.io. Stay updated on the latest in cryptocurrency by subscribing to our newsletter below.

Subscribe to our newsletter

Subscribe and never miss any updates related to our APIs, new developments & latest news etc. Our newsletter is sent once a week on Monday.