Cover Image for Google BigQuery Ethereum Dataset: A Comprehensive Tutorial​

Google BigQuery Ethereum Dataset: A Comprehensive Tutorial​

Data in the Cloud
Blockchain
Tutorial

If you want to query blockchain data, querying it from a data indexing service will be time-consuming. Most of the indexing APIs will have some kind of limit on them, so you have to query them again and again till you get all the data you need. What if there was indexed data stored on a SQL database that you could access on demand without any restrictions?

Various public and private datasets offer access to blockchain data. Here are a few notable ones:

  • BigQuery Public Cryptocurrency Datasets
  • AWS Public Blockchain Data
  • Bitquery’s Data in Cloud

In this tutorial, we are going to learn about Google BigQuery which provides us a way to query large amounts of data for public blockchains like Bitcoin, Ethereum, Polygon and many more with their public dataset.

Today we are going to learn how to query public Ethereum datasets and query data in the Google Cloud SQL workspace using SQL, and how to use the Python client for BigQuery to fetch data from those public datasets. We will also learn about the drawbacks of these public datasets and how to overcome them.

Prerequisites

As the title suggests, we are going to use Google products, so you need a Google account. For technical prerequisites, you should have basic knowledge of SQL, and if you’re going to continue with the Python part of the tutorial, you should have some knowledge of Python.

Setting Up Google BigQuery

If you're new to Google BigQuery, the first step is to set up your account and create a new project within Google BigQuery. You can initiate this process by visiting the following link: https://console.cloud.google.com/bigquery

Once your account setup is complete, you'll gain access to all the publicly available datasets on Google BigQuery. Additionally, you'll be able to use the built-in editor to write and execute queries, allowing you to explore the Ethereum public dataset.

Google BigQuery Dashboard

You can open a new tab by clicking on the blue plus button, which will display a new editor where you can write your SQL queries.

Accessing Blockchain Data via BigQuery Dashboard

Let's begin by examining the public Ethereum dataset. You can find detailed information about the dataset here.

Google Bigquery Public Ethereum Dataset Product Page Screenshot

Upon visiting the page, you'll notice that there's a free tier available for BigQuery, allowing us to explore the data. The page also provides some examples illustrating how the dataset can be utilized.

To dive deeper into the dataset, let's initiate our exploration by running our first query to discover the available tables within this dataset. Open your editor and paste the following query:

SELECT *
FROM `bigquery-public-data.crypto_ethereum.INFORMATION_SCHEMA.TABLES`;

In this query, we are accessing the metadata of the dataset to understand its structure using the dataset ID, which, in this case, is bigquery-public-data.crypto_ethereum. To access the metadata, we utilize INFORMATION_SCHEMA.TABLES, and by selecting all columns using *, we can view all the details.

To execute the query, simply click the "Run" button located at the top of the editor window, and you should see the following result:

https://lh6.googleusercontent.com/7f_eZfm1FKd8ITrsPFiwoKlUHBzslj5rpkIpnNCN57jsrfB3sxpNHKg4f-BM2A1Jx_2RPGh8Ao8cZgXRC0m-EQwTLomqDi0oNwbZpkAly_KV1r9na2v_sAVGEMMWJ30KLssc1oFIbu0WMlCjrcBdbcU

You'll be presented with a list of table names, including tokens, blocks, contracts, and more. In total, there are 11 tables available in this dataset, which are as follows:

  • tokens
  • blocks
  • contracts
  • traces
  • token_transfers
  • balances
  • transactions
  • sessions
  • logs
  • load_metadata
  • amended_tokens

In this tutorial, we’ll explore some of these tables by writing queries to query data from them, like query to get Ethereum balances for specific addresses and the latest transactions from a given address, etc. Let's start with the balances table.

Balances

Before diving into the Ethereum Balances table, it's important to familiarize ourselves with its schema to understand its structure. To query the schema, execute the following query:

SELECT
	column_name,
	data_type
FROM
	`bigquery-public-data.crypto_ethereum.INFORMATION_SCHEMA.COLUMNS`
WHERE
	table_name = 'balances';

This query will provide you with the following result:

https://lh6.googleusercontent.com/p3J_Gc3-JSDWMjNNzpxNH3lZqjSL6ts3ebPRSvTxVW95qPXr3QaC6J4VOAFYRPO8ZaM8tlGrnjxGTF9GIdiNKMFMcL75cY9Lu08AlBsvrkfcq_j5rQP9ZcR73MwVXLGT2mJoasM2YdjxVRFxtHm8kQw

In the schema, we have the 'address' field, which is of type STRING, and the eth_balance field, which is of type NUMERIC.

This dataset set is limited to the current balance history of the address and not the balance of ERC20 tokens, the balance history of native tokens, or ERC20 tokens. Let's write some queries to extract data related to addresses.

Get balance of the address

To query the balance of a specific address, we need to filter the data based on the address using the WHERE keyword:

SELECT address, eth_balance
FROM `bigquery-public-data.crypto_ethereum.balances`
WHERE address = '0xcda7559bcef42e68f16233b5b8c99c757a5f4697';

https://lh6.googleusercontent.com/6bxau_qkyvnjKvuxCBKiEUotcQm7MOwdrjQwKiME1Cqw9tgnakiu3itcYfL3UUxdtI-K3bMa1uiG_hBF6Dr0lzS1Gr5XAGVcUXA4KJtVPAkjOFwMRv1koMnoNGw18av4vXPqYxd4QQ7aQyLkjuc1sc0

Get Top 10 Addresses with Highest Balances

To get a list of the top 10 addresses with the highest balances, we simply need to order the eth_balance in descending order and limit the results to the first 10:

SELECT address, eth_balance
FROM `bigquery-public-data.crypto_ethereum.balances`
ORDER BY eth_balance DESC
LIMIT 10;

https://lh4.googleusercontent.com/x3IXyaKZTN6BWHq428tsOg7E4HK4rp6BQHQo-aP0tJFxGdzpM_hy24LojvzHULCPiu6Fv4IHVP-4Bz6YAHRyd9K-85nFR_dXorUsahNxN10NOxZOxXeV1rWI5CH0dl-j6mDbMsGYSqAu4WKemDyn5ic

Transactions

The Transactions table provides information about transactions that have occurred on the Ethereum blockchain. Before we dive into querying this dataset, let's explore its schema by running the following query:

SELECT
	column_name,
	data_type
FROM
	`bigquery-public-data.crypto_ethereum.INFORMATION_SCHEMA.COLUMNS`
WHERE
	table_name = 'transactions';

https://lh4.googleusercontent.com/ZLKlRRrmVQBfyXFv83avq7evwJ8tawz9L0GHhD9lXgqioCWJolvOgPWCbwk2YOGjfyhOkw9znKR-SLnUTnbjphhEikvUmHsh7-DHZ5u9TYqob2hmkMYPzwIfdk1IjFdFHHEsxk2GVkd5x3_nVIwVw48

This query will display a number of columns related to transactions. The dataset provides comprehensive information about transactions, but for internal transaction details, you can explore them separately. Let's delve into this dataset by writing some queries.

Get transactions by transaction hash

To query a transaction by its hash, we'll use the WHERE clause to specify the hash value:

SELECT *
FROM `bigquery-public-data.crypto_ethereum.transactions`
WHERE `hash` = "0x1f5abf832162265242a27b7f8bf6a6ec6b2f4a4fd6d9e681cab6924807636fd7"
	AND block_timestamp > TIMESTAMP '2023-09-06 00:00:00';

You'll notice a difference here compared to the previous queries. In the query above, you'll see that the term hash is enclosed in backticks (), and there's also a block_timestampvalue included. Why is this the case? Well,hashhappens to be a keyword in BigQuery, so to indicate thathash` is the name of the column, we enclose it in backticks.

Additionally, the block_timestamp filter is included to narrow down the data we're retrieving, making it more efficient by filtering through a smaller subset of the table rather than the entire table. This optimization improves the performance of our query.

After running this query, you will see the following result:

https://lh3.googleusercontent.com/uvmA-fLJ-vAnPODCefBmBEFLrQPd2gobBZ0RvUi8Cbqh5hamiYM8BuXfhz-BtB7m0Rjk99a9a17JJoYh8SXloilwABdCI7YUXsiVTC4jNoAySOPzNV07Rc6uVxnPzZ2WnY-YJiXsTOLoCnFPJ7HAHN8

Get latest transaction from an address

To get the latest transaction from a specific address, we'll check if either the from_address or to_address matches the target address and then order the results by the transaction timestamp:

SELECT *
FROM `bigquery-public-data.crypto_ethereum.transactions`
WHERE (from_address = '0xcda7559bcef42e68f16233b5b8c99c757a5f4697'
	OR to_address = '0xcda7559bcef42e68f16233b5b8c99c757a5f4697')
ORDER BY block_timestamp DESC
LIMIT 10;

Token Transfers

The Token Transfers table provides information on ERC20 transfers from the Ethereum blockchain. Along with query table schema via editor, you can also explore the schema in the BigQuery Dashboard using the explorer, which is on the right-hand side.

https://lh4.googleusercontent.com/AZu9NCzEJtwW3qccLcrvwGc1n7b0lX0_IPwxoawQcXFK977oMHZiWkF6w7A5kg7qJD3HY01pYmbcEaatIXDfvXhMAGCyKwxXCEo-bBQOuhk1uT-DbdI-QlA512TpYbNh_NnEWp04Nbnc--L1DNgcwMY

You can explore details of the dataset, ranging from the schema of the table to a preview of the data, among other details.

Get token transfers of address

To get token transfers for a specific address, the query is similar to fetching transactions for an address, but with a change in the queried table from transaction to token_transfers:

SELECT *
FROM `bigquery-public-data.crypto_ethereum.token_transfers`
WHERE (from_address = '0xcda7559bcef42e68f16233b5b8c99c757a5f4697'
	OR to_address = '0xcda7559bcef42e68f16233b5b8c99c757a5f4697')
ORDER BY block_timestamp DESC
LIMIT 10;

Get Latest USDT Transfers

To specifically get USDT transfers, we use WHERE clauses and set the token_address to “0xdAC17F958D2ee523a2206206994597C13D831ec7”.

SELECT *
FROM `bigquery-public-data.crypto_ethereum.token_transfers`
WHERE
	token_address = ‘0xdAC17F958D2ee523a2206206994597C13D831ec7’
	AND block_timestamp > TIMESTAMP '2023-09-04 00:00:00'
ORDER BY block_timestamp DESC
LIMIT 10;

We have explored some tables and written queries that show how to query data. However, you can also create more complex queries by combining multiple tables to extract more meaningful insights. Here are some example queries that you can explore:

Querying Blockchain Data Using Python

While you can query data using SQL in the editor, you also have the option to explore the data in a Python notebook. The dashboard provides you with the tool to automatically generate a Python notebook for the query you have executed. To create one, click on "Explore Data" and select the "Explore with Python Notebook" option.

https://lh5.googleusercontent.com/4E2KvjgQJxXPf0G-hft2hJNpvnSRyAaX-6HUWRJr8aU-1hswc7eUJtkJWBr3k7wtNTAmMVlgGyIBQ3X4z-WWYsuiCesE11qPm8wIf0wCY8NiZCdRPzOUMMe-mlNILWht7jATS0zeusaGNUlBzgLymJs

This will provide you with a ready-made Python Notebook from which you can access data from the BigQuery Ethereum dataset.

Problems with using BigQuery for Blockchain Data

BigQuery makes it easy to access data with its public datasets, which are great for simple applications. But these datasets have some limits, so if you need a wider range of data, you might face problems.

If you want data through an API, you'd usually need to build your own systems to handle it. To help with this, Bitquery offers different products for different application needs.

Bitquery can give you indexed blockchain data via API, and you can also explore it using our IDE. There's another product called "Data in the Cloud," similar to BigQuery's public dataset, which we'll talk about in the next section.

Bitquery's data covers transactions, token transfers, DEX trades, block data, smart contract details like calls and events, and more. If you want to explore this data, our IDE uses GraphQL to make it easy, so you don't need to worry about complex SQL queries. It even has a query builder, which is helpful if you're not familiar with GraphQL or our API.

Data in the Cloud by Bitquery

Bitquery’s “Data in the Cloud” product provides an even richer dataset that covers the data BigQuery provides and much more.

Datasets provided with the “Data in the Cloud’ product provide comprehensive data coverage for many other things like smart contract calls, events, token transfers (including ERC20, ERC-721, and ERC-1155), DEX trades for 40+ protocols like Uniswap, Balancer, 1Inch, and many more.

In short, if you’re looking for something that provides comprehensive blockchain data via a cloud product, Data in the Cloud is the solution you’re looking for.

Check out demo datasets for the Data in the Cloud. If you want to talk to us, you can contact us here

Also Read

Subscribe to our newsletter

Subscribe and never miss any updates related to our APIs, new developments & latest news etc. Our newsletter is sent once a week on Monday.