Google Code offered in: English - Español - 日本語 - 한국어 - Português - Pусский - 中文(简体) - 中文(繁體)
The appcfg.py
tool can upload data to your application's datastore. With just a little bit of set-up, you can create new datastore entities from CSV files. Most spreadsheet applications can export CSV files, making it easy for non-developers and other application to produce data that can be imported into your application.
April 2009: Release 1.2.1 of the Python SDK includes an early version of a feature for downloading data, similar to the data upload feature. This feature is not yet available through appcfg.py
, but if you'd like to try it, take a look at bulkloader.py
. We will release this feature in full, including documentation, in a future version of the SDK.
The data loader feature communicates with your application running on App Engine using remote_api, a request handler included with the App Engine Python library that allows remote applications with the proper credentials to access the datastore remotely. To use remote_api, you must map a URL to it in your app.yaml
file.
Edit your app.yaml
, and add the following lines to the handlers:
section:
- url: /remote_api script: $PYTHON_LIB/google/appengine/ext/remote_api/handler.py login: admin
This maps the remote_api request handler to the URL /remote_api
for your application. You can use any URL you like. Access to this URL is restricted to users are administrators for the application.
Update your app to install the new app.yaml
and the remote_api URL:
appcfg.py update
appcfg.py upload_data
uses a data file to keep track of its progress while uploading the data, in case the upload is interrupted and needs to be restarted from the middle of a data file. This feature requires the Sqlite 3 library to manage the progress file. If you do not have Sqlite 3 installed on your system, see the Sqlite web site for information on downloading and installing Sqlite 3.
Note: If you installed Python on your system by compiling it from source code, you will need to recompile Python with Sqlite support after installing Sqlite.
You can test if Sqlite is installed for Python by entering the following command:
python -c 'import sqlite3'
If the command does nothing, then Sqlite 3 is installed. If Sqlite 3 is not installed, it prints: ImportError: No module named sqlite3
If you do not wish to install Sqlite 3, you can disable the use of a progress file. To do so, give appcfg.py upload_data
this argument: --db_filename=skip
If you disable the progress file and the upload is interrupted, you will have to edit your CSV file to prevent duplicate entities from being created when you restart the upload.
To tell appcfg.py upload_data
how to translate each row in the CSV file to a datastore entity, you define loader classes in a Python source file. The file imports or defines the Model classes for the entities being created, and defines a loader class for each kind you wish to import.
For example, say you have a Model class named Album
defined in a file named models.py
(which is in your PYTHONPATH, such as the directory where you'll run the tool) that resembles the following:
from google.appengine.ext import db class Album(db.Model): artist = db.StringProperty() title = db.StringProperty() publication_date = db.DateProperty() length_in_minutes = db.IntegerProperty()
You wish to import a CSV file that has the columns in the following order: title, artist, publication date, and length in minutes. The CSV file contains string representations for each of these values.
The following is a loader class file for this data:
import datetime from google.appengine.ext import db from google.appengine.tools import bulkloader import models class AlbumLoader(bulkloader.Loader): def __init__(self): bulkloader.Loader.__init__(self, 'Album', [('title', str), ('artist', str), ('publication_date', lambda x: datetime.datetime.strptime(x, '%m/%d/%Y').date()), ('length_in_minutes', int) ]) loaders = [AlbumLoader]
The bulk loader tool looks for a global variable in the loader class file named loaders
, whose value is a list of loader classes to use. In this example, the tool loads the AlbumLoader
class for loading entities of the kind Album
.
The loader class defines an __init__()
method that calls the __init__()
on the Loader
class. The first argument is self
, the loader class instance. The second argument is the name of the datastore kind as a string, in this case 'Album'
. The third argument is a sequence of tuples, where each tuple contains the name of the property (a string) and a conversion function. The conversion function must take a string and return one of the datastore value types.
In this example, the 'title'
and 'artist'
properties both take string values, so the conversion function is str
, the string constructor. The 'length_in_minutes'
property takes an integer, and the int
constructor takes a string and converts it to an integer value.
For the 'publication_date'
property, the model needs a datetime
value. In this case, we know (from our data file) that publication dates are represented in the form mm/dd/yyyy
. The conversion function is a Python lambda expression (a short function) that takes a string, then passes it to datetime.datetime.strptime()
with a pattern to parse the value to a datetime.datetime
, then calls its date()
method to get the final datetime.date
value.
If any conversion function raises an exception or fails to return a value that meets the requirements of the Model class, processing of the data file stops and reports an error. If you are using a progress file (with Sqlite, as described above), you can fix the CSV file (or the code, as needed), then re-run the upload to continue starting from the row where the error occurred.
If the name of the file containing the loader class definition is album_loader.py
, you would give appcfg.py upload_data
the following argument: --config_file=album_loader.py
appcfg.py upload_data
accepts data in the form of a file in the CSV (Comma Separated Values) format, a simple text file representing a table of values, with one row per line and columns delimited by commas (,
). If a value contains one or more commas, the value is surrounded by double quotes ("
). If a value contains a double quote, the double quote appears twice (""
). The tool uses the csv module from the Python standard library to parse the data file.
The file must use the UTF-8 encoding for the text data. UTF-8 encoding is compatible with ASCII encoding, so any spreadsheet application that saves CSV as ASCII will also work.
Most spreadsheet applications can export a sheet as a CSV file. To export a sheet as CSV from Google Spreadsheets, select the File menu > Export > .csv Sheet Only, then save the file from the browser window that opens.
If the name of the CSV file to load is album_data.csv
, you would give appcfg.py upload_data
the following argument: --filename=album_data.csv
If the CSV file's first row is not data (such as if it's a header row), use the following option to skip the first row: --has_header
To start the data upload, run appcfg.py upload_data
with the appropriate arguments:
appcfg.py upload_data --config_file=album_loader.py --filename=album_data.csv --kind=Album <app-directory>
If you are using a Google Apps domain name and need appcfg.py to sign in using an account on that domain, you must specify the --auth_domain=...
option, whose value is your domain name.
If you'd like to test how your data works with the app before uploading it, you can load it into the development server. Use the --url
option to point the tool at the development server URL. For example:
appcfg.py upload_data --config_file=album_loader.py --filename=album_data.csv --kind=Album --url=http://localhost:8080/remote_api <app-directory>
The appcfg.py upload_data
command accepts the following arguments. See also the other options appcfg.py
accepts for all actions, listed in Uploading and Managing a Python App: Command-Line Arguments.
appcfg.py upload_data [options]
<app-directory>--filename=...
Required. The path to the CSV data file to load.
--kind=...
Required. The name of the datastore kind to use for creating new entities.
--config_file=...
Required. A Python source file that imports or defines Model classes for the kinds of entities that an upload might create, as well as Loader classes for each kind. appcfg.py upload_data
provides the Loader
base class in the local namespace when evaluating this file.
--loader_opts=...
An option to pass to the Loader class's initialize()
method. You can implement this method to pass arguments to your Loader classes.
--log_file=...
The name of the file to write logging information about the upload. The default is to create a file named bulkloader-log-timestamp
in the current working directory (where timestamp
is the time the tool is run).
--auth_domain=...
The name of the authorization domain of the account to use to contact remote_api. If you're using a Google Apps domain and need appcfg.py to sign in using a Google Apps account, specify your domain name with this option.
--num_threads=#
The number of threads to spawn to upload new entities in parallel. The default is 10.
--batch_size=#
The number of entities to create with each remote_api call. For large entities, use a small batch size to limit the amount of data posted per batch. The default is 10.
--bandwidth_limit=#
The maximum total number of bytes per second that all threads should send. Bursts may exceed this maximum, but the overall bandwidth will be less than this amount. The default is 250,000 bytes per second.
--rps_limit=#
The maximum total number of records per second that all threads should send. The default is 20.
--http_limit=#
The maximum total number of HTTP requests per second that all threads should send. The default is 7.5 per second (15 per 2 seconds).
--db_filename=...
The filename to use for the progress file for this run. If not specified, this is named bulkloader-progress-timestamp
, where timestamp
represents the time the command is run. If this argument is specified with the value skip
, the upload will not use a progress file.
--has_header
If present, skips the first row in the CSV file, assuming it's a header row.
--app_id=...
The application ID, if different from the application ID specified in the app's app.yaml
file.
--url=...
The URL of the remote_api handler to use to connect to the datastore. By default, this is derived from the app's app.yaml
file.