Google App Engine: efficient large deletes (about 90000/day)
I have an application that has only one Model with two StringProperties.
The initial number of entities is around 100 million (I will upload those with the bulk loader).
Every 24 hours I must remove about 70000 entities and add 100000 entities. My question is now: what is the best way of deleting those entities?
Is there anyway to avoid fetching the entity before deleting it? I was unable to find a way of doing something like:
DELETE from xxx WHERE foo1 IN ('bar1', 'bar2', 'bar3', ...)
I realize that app engine offers an IN clause (albeit with a maximum length of 30 (because of the maximum number of individual requests per GQL query 1)), but to me that still seems strange because I will have to get the x entities and then delete them again (making two RPC calls per entity).
Note: the entity should be ignored if not found.
EDIT: Added info about problem
These entities are simply domains. The first string being the SLD and the second the TLD (no subdomains). The application can be used to preform a request like this http://[...]/available/stackoverflow.com . The application will return a True/False json object.
Why do I have so many entities? Because the datastore contains all registered domains (.com for now). I cannot perform a whois request in every case because of TOSs and latency. So I initially populate the datastore with an entire zone file and then daily add/remove the domains that have been registered/dropped... The problem is, that these are pretty big quantities and I have to figure out a way to keep costs down and add/remove 2*~100000 domains per day.
Note: there is hardly any computation going on as an availability request simply checks whether the domain exists in the datastore!
1: ' A maximum of 30 datastore queries are allowed for any single GQL query.' (http://code.google.com/appengine/docs/python/datastore/gqlreference.html)
If are not doing so already you should be using key_names
for this.
You'll want a model something like:
class UnavailableDomain(db.Model):
pass
Then you will populate your datastore like:
UnavailableDomain.get_or_insert(key_name='stackoverflow.com')
UnavailableDomain.get_or_insert(key_name='google.com')
Then you will query for available domains with something like:
is_available = UnavailableDomain.get_by_key_name('stackoverflow.com') is None
Then when you need to remove a bunch of domains because they have become available, you can build a big list of keys without having to query the database first like:
free_domains = ['stackoverflow.com', 'monkey.com']
db.delete(db.Key.from_path('UnavailableDomain', name) for name in free_domains)
I would still recommend batching up the deletes into something like 200 per RPC, if your free_domains list is really big
have you considered the appengine-mapreduce library. It comes with the pipeline library and you could utilise both to:
上一篇: GAE中的高端RPC调用是否正常?