Calculating unique elements from huge list in Google App Engine
I got a web widget with 15,000,000 hits/months and I log every session. When I want to generate a report I'd like to know how many unique IP there are. In normal SQL that would be easy as I'd just do a: SELECT COUNT(*) FROM (SELECT DISTINCT IP FROM SESSIONS) But as that's not possible with the app engine, I'm now looking into solutions on how to do it. It doesn't need to be fast. A solution I was thinking of was to have an empty Unique-IP table, then have a MapReduce job to go through all session entities, if the entity's IP is not in the table I'll add it and add one to a counter. Then I'd have another MapReduce job that would clear the table. Would this be crazy? If so, how would you do it? Thanks!
The mapreduce approach you suggest is exactly what you want. Don't forget to use transactions to update the record in your task queue task, which will allow you to run it in parallel with many mappers. In future, reduce support will make this possible with a single straightforward mapreduce and no hacking around with your own transactions and models.
If time is not important and you may try taskqueue with a task limit of 1. Basically you'd use a recursive task that queries through a batch of log records until it hits DeadlineExceededError. Then you'd write the results to datastore and the task would enqueue itself with the query end cursor/last record's key value to start the fetch operation where it stopped last time.
Google App Engine Custom Domain - Routing in Go
Best approach for caching lists of objects in memcache
com.google.gcloud.datastore vs com.google.appengine.api.datastore
Google App Engine - Issue with creating a bulkloading config
Solr Timeout error even data is instered
Why does BigQuery fail to parse an Avro file that is accepted by avro-tools?
Insufficient Permission with Appengine Flex service account to access Drive folder
Getting Invalid Key message thrown when creating child records
Where do I find the pricing for the different frontend instance types?
Missing index on specific entities in app engine
Get current deployed timestamp in AppEngine/Go
Jersey throwing exception in Google App Engine
Uploading >10k files as static content to GAE
Dealing with large zip uploads and extracting using google cloud
Android Studio not recognizing gradle 2.10
How to solving ImportError: No module named scraping