Running a web crawler for selected sites on google app engine?
I need to write a crawler to extract some info from few pre-slected websites only. I know this is a straightway job but am thinking of using google app engine to get this done. May be I can try Nutch to do this for me. How feasible is this way of getting it done? 1) hosting a crawler on google infrastructure 2) Nutch + app engine- will it be possible?
Just glancing over the nutch docs, I see comments like "[t]his is the second release of Nutch based entirely on the underlying Hadoop platform" which make me suspect this will not run on App Engine. App Engine apps run in a Python or Java sandbox. That said, you should be able to put a basic crawler together on App Egnine. I basic implementation would probably involve launching tasks that use urlfetch to grab pages, and then, optionally, insert additional tasks to process links the document links to. You can kick the crawl off using scheduled tasks.
Maintaining separate environments with no downtime on update in app engine
managing app engine versions through API calls
Go GAE Using LoginURLFederated function returns API error 2 (user: NOT_ALLOWED)
Approaches for overcoming 10000 file limit on Google App Engine?
App Engine: Copy live Datastore to local dev Datastore (that still works)
All of my applications throw “deadline was exceeded”. GAE breakdown?
Splitting entities vs. using transactions
DataNucleus on AppEngine not returning any results
Is there API for setting TLS certificates for Google App Engine?
GAE printing same log statement multiple times
Adding multiple accounts for “You do not have permission to modify this app” error
Logs are Not Nested Under Requests in Flexible VM
How to create an equivalent of a background thread for an auto-scaling instance
Communication between modules locally
Technology for realtime messaging to mobile apps
how to apply date filter on ancestor query