google-app-engine


Running a web crawler for selected sites on google app engine?


I need to write a crawler to extract some info from few pre-slected websites only.
I know this is a straightway job but am thinking of using google app engine to get this done.
May be I can try Nutch to do this for me.
How feasible is this way of getting it done?
1) hosting a crawler on google infrastructure
2) Nutch + app engine- will it be possible?
Just glancing over the nutch docs, I see comments like "[t]his is the second release of Nutch based entirely on the underlying Hadoop platform"
which make me suspect this will not run on App Engine. App Engine apps run in a Python or Java sandbox.
That said, you should be able to put a basic crawler together on App Egnine. I basic implementation would probably involve launching tasks that use urlfetch to grab pages, and then, optionally, insert additional tasks to process links the document links to. You can kick the crawl off using scheduled tasks.

Related Links

Maintaining separate environments with no downtime on update in app engine
managing app engine versions through API calls
Go GAE Using LoginURLFederated function returns API error 2 (user: NOT_ALLOWED)
Approaches for overcoming 10000 file limit on Google App Engine?
App Engine: Copy live Datastore to local dev Datastore (that still works)
All of my applications throw “deadline was exceeded”. GAE breakdown?
Splitting entities vs. using transactions
DataNucleus on AppEngine not returning any results
Is there API for setting TLS certificates for Google App Engine?
GAE printing same log statement multiple times
Adding multiple accounts for “You do not have permission to modify this app” error
Logs are Not Nested Under Requests in Flexible VM
How to create an equivalent of a background thread for an auto-scaling instance
Communication between modules locally
Technology for realtime messaging to mobile apps
how to apply date filter on ancestor query

Categories

HOME
windows
xml
localization
pda
numbers
carousel
cq5
limit
maxscript
tvos
performancepoint
pattern-matching
octave
plist
dynatrace
zerobrane
symmetricds
altera
package.json
vuejs
cfml
selinux
undertow
named-entity-recognition
postscript
pimcore
imageresizer
data.stackexchange.com
opentk
jackson-modules
linked-data
intersystems-cache
tpm
exchangewebservices
zend-debugger
indy10
apiary.io
raft
sim-toolkit
health-monitoring
ogc
doc
sharp-snmp
android-checkbox
phpdbg
puredata
fwrite
netbeans6.8
trojan
spring-data-couchbase
mouseleave
materialized-path-pattern
seek
greatest-n-per-group
signalr.client
mapinfo
installshield-2010
vorpal.js
kango-framework
deep
phpwebsocket
halcon
grass
kendo-upload
phpgrid
toran-proxy
r-package
wiredep
video-player
binomial-theorem
jdk1.7
typemock
canvg
jquery-forms-plugin
cakephp-2.7
delayed-execution
punycode
trdion2011
unix-socket
ss7
wxformbuilder
dct
qscrollarea
http-patch
scrypt
kaleidoscope
poker
abstract-data-type
dopostback
faye
window-decoration
kqueue
cdonts
objective-j
dot-emacs
aspmenu
openfaces
pos-for-.net
blender-2.49
pagemethods
echo3
application-planning
business-model
xslcompiledtransform

Resources

Encrypt Message