Google App Engine: efficient large deletes (about 90000/day)

I have an application that has only one Model with two StringProperties.

The initial number of entities is around 100 million (I will upload those with the bulk loader).

Every 24 hours I must remove about 70000 entities and add 100000 entities. My question is now: what is the best way of deleting those entities?

Is there anyway to avoid fetching the entity before deleting it? I was unable to find a way of doing something like:

DELETE from xxx WHERE foo1 IN ('bar1', 'bar2', 'bar3', ...)

I realize that app engine offers an IN clause (albeit with a maximum length of 30 (because of the maximum number of individual requests per GQL query 1)), but to me that still seems strange because I will have to get the x entities and then delete them again (making two RPC calls per entity).

Note: the entity should be ignored if not found.

EDIT: Added info about problem

These entities are simply domains. The first string being the SLD and the second the TLD (no subdomains). The application can be used to preform a request like this http://[...]/available/stackoverflow.com . The application will return a True/False json object.

Why do I have so many entities? Because the datastore contains all registered domains (.com for now). I cannot perform a whois request in every case because of TOSs and latency. So I initially populate the datastore with an entire zone file and then daily add/remove the domains that have been registered/dropped... The problem is, that these are pretty big quantities and I have to figure out a way to keep costs down and add/remove 2*~100000 domains per day.

Note: there is hardly any computation going on as an availability request simply checks whether the domain exists in the datastore!

1: ' A maximum of 30 datastore queries are allowed for any single GQL query.' (http://code.google.com/appengine/docs/python/datastore/gqlreference.html)


If are not doing so already you should be using key_names for this.

You'll want a model something like:

class UnavailableDomain(db.Model):
    pass

Then you will populate your datastore like:

UnavailableDomain.get_or_insert(key_name='stackoverflow.com')
UnavailableDomain.get_or_insert(key_name='google.com')

Then you will query for available domains with something like:

is_available = UnavailableDomain.get_by_key_name('stackoverflow.com') is None

Then when you need to remove a bunch of domains because they have become available, you can build a big list of keys without having to query the database first like:

free_domains = ['stackoverflow.com', 'monkey.com']
db.delete(db.Key.from_path('UnavailableDomain', name) for name in free_domains)

I would still recommend batching up the deletes into something like 200 per RPC, if your free_domains list is really big


have you considered the appengine-mapreduce library. It comes with the pipeline library and you could utilise both to:

  • Create a pipeline for the overall task that you will run via cron every 24hrs
  • The 'overall' pipeline would start a mapper that filters your entities and yields the delete operations
  • after the delete mapper completes, the 'overall' pipeline could call an 'import' pipeline to start running your entity creation part.
  • pipeline api can then send you an email to report on it's status
  • 链接地址: http://www.djcxy.com/p/23216.html

    上一篇: GAE中的高端RPC调用是否正常?

    下一篇: Google App Engine:高效的大型删除(约90000 /天)