google-app-engine


Calculating unique elements from huge list in Google App Engine


I got a web widget with 15,000,000 hits/months and I log every session. When I want to generate a report I'd like to know how many unique IP there are. In normal SQL that would be easy as I'd just do a:
SELECT COUNT(*) FROM (SELECT DISTINCT IP FROM SESSIONS)
But as that's not possible with the app engine, I'm now looking into solutions on how to do it. It doesn't need to be fast.
A solution I was thinking of was to have an empty Unique-IP table, then have a MapReduce job to go through all session entities, if the entity's IP is not in the table I'll add it and add one to a counter. Then I'd have another MapReduce job that would clear the table. Would this be crazy? If so, how would you do it?
Thanks!
The mapreduce approach you suggest is exactly what you want. Don't forget to use transactions to update the record in your task queue task, which will allow you to run it in parallel with many mappers.
In future, reduce support will make this possible with a single straightforward mapreduce and no hacking around with your own transactions and models.
If time is not important and you may try taskqueue with a task limit of 1. Basically you'd use a recursive task that queries through a batch of log records until it hits DeadlineExceededError. Then you'd write the results to datastore and the task would enqueue itself with the query end cursor/last record's key value to start the fetch operation where it stopped last time.

Related Links

Google App Engine Custom Domain - Routing in Go
Best approach for caching lists of objects in memcache
com.google.gcloud.datastore vs com.google.appengine.api.datastore
Google App Engine - Issue with creating a bulkloading config
Solr Timeout error even data is instered
Why does BigQuery fail to parse an Avro file that is accepted by avro-tools?
Insufficient Permission with Appengine Flex service account to access Drive folder
Getting Invalid Key message thrown when creating child records
Where do I find the pricing for the different frontend instance types?
Missing index on specific entities in app engine
Get current deployed timestamp in AppEngine/Go
Jersey throwing exception in Google App Engine
Uploading >10k files as static content to GAE
Dealing with large zip uploads and extracting using google cloud
Android Studio not recognizing gradle 2.10
How to solving ImportError: No module named scraping

Categories

HOME
linkedin
angular-formly
battery
composite-primary-key
appcelerator
soa
limit
jquery-ui
nfs
rcloud
ibeacon-android
siddhi
mxgraph
google-form
code-generation
scrolltop
magnetic-cards
android-7.0-nougat
symfony-console
nodemcu
gmm
sencha-touch
topology
microsoft-ocr
guzzle
perlbrew
corpus
has-and-belongs-to-many
comparable
realm-mobile-platform
dotspatial
testbed
jmockit
quickfixj
r-grid
sammy.js
hibernate-search
css-position
mongoose-schema
chef-solo
skylink
placeholder
apache-directory
x++
fwrite
spring-integration-sftp
arules
cmp
swingworker
atlas
reporting-services-2012
myspace
watch-os-2
macaulay2
x-tag
mapinfo
lean
jtwig
ikiwiki
google-books
opserver
airbrake
pyobjc
public-html
infosphere-spl
pseudo-class
addressing-mode
template-deduction
lnk
paste
cfcache
pagedown
jdb
oracle-enterprise-linux
window-decoration
box2d-iphone
convention
jquery-tools
update-statement
umra
silent
sifr3

Resources

Encrypt Message