Top 1% Smart Contracts Generated 96% of all Ethereum Logs — an Analysis via Bash Scripting

Ethereum logs always stump newcomers. A typical Ethereum log looks like this:

$ curl https://stream.0xfast.com/beta/eth-mainnet/15000000 -s | jq '.transactions[].logs'

...

{
  "address": "0x9e0905249ceefffb9605e034b534544684a58be6",
  "topics": [
    "0xc42079f94a6350d7e6235f29174924f928cc2ac818eb64fed8004e115fbcca67",
    "0x00000000000000000000000068b3465833fb72a70ecdf485e0e4c7bd8665fc45",
    "0x0000000000000000000000004c0ae29009867ff7240b730c85eef24be0d8f97d"
  ],
  "data": "0xffffffffffffffffffffffffffffffffffffffffffffffffffffff12f5a8fd8400000000000000000000000000000000000000000000000004db7325476300000000000000000000000000000000000000000249c4616d93729ac0ce8ab2522b00000000000000000000000000000000000000000000000007a8eea51883cadc000000000000000000000000000000000000000000000000000000000001f1e8",
  "blockNumber": "0xe4e1c0",
  "transactionIndex": "0x158",
  "logIndex": "0x2cd",
  "@type": "Log"
}

Looking at this log, it all seems strange. Address, topics, data — what are all these things? There are plenty of posts online to explain what these fields are, so I won’t go too much deeper into it. But, to summarize:

  • The address is the Smart Contract address that the Ethereum transaction executed.
  • The topic at index 0 is a hash of function signature that was invoked.
  • The topics at index 1 and onwards are the indexed” arguments to that function. Indexed to make searching over Logs easier.
  • Data contains the unindexed arguments to that function. They can only be retrieved, not searched over.

So, essentially, what a log is telling us is that a particular function was invoked located at a particular address, along with the arguments that were passed.

Decoding this however requires access to the ABI — the Application Binary Interface — a play on the more broadly known API — Application Programming Interface. The idea is the same. To understand what methods are possible, and what arguments they accept, you need to have a defined list. That defined list is stored in an ABI, which can be downloaded in a JSON format.

Motivation

I recently launched 0xFast Stream, which makes it easier for anyone to stream Ethereum blocks, transactions, and logs. But, logs are in this encoded, hard-to-understand format. It would make sense to decode the logs into the function signature and the arguments.

But to parse a log, you need its contract’s ABI. The problem is, there are millions of smart contracts deployed on Ethereum Mainnet — and new ones are being generated on a daily basis. So, which ones do you download?

Intuitively, I knew that some smart contracts would be invoked way more than others. So, if I generate a list of smart contracts which cover, say, 90% of the logs, that should give a great starting point.

Essentially, I needed a distribution of smart contract -> number of logs. To generate this distribution, I reverted back to the Stream service.

Map Reduce Over Ethereum Blocks

I wrote a bash script, which can download a range of blocks via 0xFast Stream, which returns the data in a JSON array. I then pipe that through jq, ensuring that it accounts for the fact that some blocks won’t have any transactions, and many transactions won’t have any logs. I then output the address field in the log, which is the smart contract address.

The addresses are then piped through sort | uniq -c and written out to a file corresponding to that block range.

#!/bin/bash

j=$1
let i=$j-999 # 999 is for block range.
f=`printf %08d $j`

curl "https://stream.0xfast.com/beta/eth-mainnet/$i-$j" -s | jq '.[] | select (.) as $blk | if ($blk.transactions | length > 0) then ($blk.transactions[] | select (.) as $txn | if ($txn.logs | length > 0) then ($txn.logs[].address) else empty end) else empty end' | sort | uniq -c > $f.cnt

echo "$i-$j -> $f.cnt | `date`"

Stream service is fast enough that we would be downloading the data faster than we can process it via a single thread. So, I needed to invoke the script concurrently to maximize the CPU usage. For that, I brought out the trusted GNU parallel.

To set parallel up, I needed an input file it can read the arguments from. I have 15.3M ETH blocks downloaded in the service (bringing it to the latest is a work in progress). I started from there and counted backward up to block 1, allowing a range of 1000 blocks in a single call.

for i in `seq 15300000 -1000 1`; do echo $i >> input.txt; done

Now that the input file was ready, I fired off parallel like so:

parallel -a input.txt -j24 ./script.sh {}

My server has 32 CPUs. Allowing for the other tasks, I gave this job 24 CPUs to churn this data.

The script finished in under 2 hours. And I had 15.3K .cnt” files containing a count of the number of logs and the corresponding smart contract address.

While the MAP phase of this job was done, the REDUCE part wasn’t done yet. I had to merge the output from all these files into one. For that, I used awk to generate the final count and then ran it through sort to sort by the smart contracts with the most logs first.

awk '{ count[$2] += $1 } END { for(elem in count) print count[elem], elem }' *.cnt > final.txt
cat final.txt| sort -nrk1 > sorted.txt

Ok. Get ready for the results!

Smart Contract Facts to Learn

  1. There are 6.3M smart contracts deployed on the Ethereum mainnet.
  2. Mainnet has ~4.3B logs.
  3. The top 11 Smart Contracts generated over a billion logs (Ok. I lied in the title. It’s not the top 10, but the top 11 which cumulatively generated a billion logs — but I’m sure the top 10 have racked up more logs since 15.3M blocks)
  4. To cover 95% of all logs (4.0B), you just need 42K smart contracts, of the 6.3M total.
  5. To cover 90% of all logs (3.8B), you just need 19K smart contracts.
  6. The top smart contract has 386M logs.
  7. The 100th smart contract from the top has 3.5M logs, only 0.9% of the top.
  8. The top 1% of smart contracts generated 4.1B logs ~ 96.5% of all logs (Approved by Bernie).
$ cat sorted.txt| wc -l
6296090 # 1

$ cat sorted.txt| awk '{a+=$1}END{print a}'
4281396796 # 2

$ cat sorted.txt| head -11 | awk '{a+=$1}END{print a}'
1005895644 # 3

$ cat sorted.txt| awk '{a+=$1; if (a < 4067326956) print $0}' | wc -l
42381 # 4

$ cat sorted.txt| awk '{a+=$1; if (a < 3853257116) print $0}' | wc -l
19233 # 5

$ cat sorted.txt| head -1
386328174 "0xc02aaa39b223fe8d0a0e5c4f27ead9083c756cc2" # 6

$ cat sorted.txt| head -100 | tail -1
3508294 "0xa15c7ebe1f07caf6bff097d8a589fb8ac49ae5b3" # 7

$ cat sorted.txt| head -63000 | awk '{a+=$1}END{print a}'
4134680762

Top 10 Most Actively Used Smart Contracts

And these are the top 10 most active smart contracts, along with a count of how many logs they generated.

$ cat sorted.txt| head -10
386328174 "0xc02aaa39b223fe8d0a0e5c4f27ead9083c756cc2"
314839126 "0xdac17f958d2ee523a2206206994597c13d831ec7"
101212190 "0xa0b86991c6218b36c1d19d4a2e9eb0ce3606eb48"
34224540 "0x6b175474e89094c44da98b954eedeac495271d0f"
31471320 "0x7be8076f4ea4a4ad08075c2508e481d6c946d12b"
30207624 "0x514910771af9ca656af840dff83e8264ecf986ca"
24645864 "0x5acc84a3e955bdd76467d3348077d003f00ffb97"
21741294 "0x8a91c9a16cd62693649d80afa85a09dbbdcb8508"
21658938 "0x8d12a197cb00d4747a1fe03395095ce2a5cc6819"
20040086 "0x95ad61b0a150d79219dcf64e1e6cc01f0b64c4ce"

Next Steps

With this information, I can download ~40K smart contract ABIs from Etherscan. That would allow me to parse 95% of the logs.

To begin with, my team downloaded the top 350 smart contracts from this list, and we have placed them here: https://github.com/outcaste-io/abis

Soon, we’ll have more of these smart contracts available, so the 0xFast Stream service can use them to parse the logs and stream readable logs — logs that everyone can understand. Stay tuned for that launch!

Meanwhile, if you’d like your smart contract to be included, please send us a PR so we can add it to the repo.


Special thanks to Ashwin, Balaji, Karthic, Nithin and Tushar for helping me with this project.



Date
October 2, 2022