English

Uploading Data

The appcfg.py tool can upload and download data to and from your application's datastore. With just a little bit of set-up, you can upload new datastore entities from CSV files, and download entity data into CSV files. Most spreadsheet applications can export CSV files, making it easy for non-developers and other application to produce data that can be imported into your application. You can customize the upload and download logic to use different kinds of files, or do other data processing.

Setting Up remote_api

The data loader feature communicates with your application running on App Engine using remote_api, a request handler included with the App Engine Python library that allows remote applications with the proper credentials to access the datastore remotely. To use remote_api, you must map a URL to it in your app.yaml file.

Edit your app.yaml, and add the following lines to the handlers: section:

- url: /remote_api
  script: $PYTHON_LIB/google/appengine/ext/remote_api/handler.py
  login: admin

This maps the remote_api request handler to the URL /remote_api for your application. You can use any URL you like. Access to this URL is restricted to users are administrators for the application.

Update your app to install the new app.yaml and the remote_api URL:

appcfg.py update

Installing Sqlite (recommended)

appcfg.py uses a data file to keep track of its progress while uploading or downloading data, in case the process is interrupted and needs to be resumed. This feature requires the Sqlite 3 library to manage the progress file. If you do not have Sqlite 3 installed on your system, see the Sqlite web site for information on downloading and installing Sqlite 3.

Note: If you installed Python on your system by compiling it from source code, you will need to recompile Python with Sqlite support after installing Sqlite.

You can test if Sqlite is installed for Python by entering the following command:

python -c 'import sqlite3'

If the command does nothing, then Sqlite 3 is installed. If Sqlite 3 is not installed, it prints: ImportError: No module named sqlite3

If you do not wish to install Sqlite 3, you can disable the use of a progress file. To do so, give appcfg.py this argument: --db_filename=skip If you disable the progress file and an upload or download is interrupted, the process must be restarted from the beginning.

Creating Loader Classes

To upload data, you must tell appcfg.py how to translate each row in the data file to a datastore entity. You do this using a file of Python code. The file imports or defines the Model classes for the entities being created, defines a loader class for each kind you wish to import, and declares the available loader classes in a global variable.

For example, say you have a Model class named Album defined in a file named models.py (which is in your PYTHONPATH, such as the directory where you'll run the tool) that resembles the following:

from google.appengine.ext import db

class Album(db.Model):
  artist = db.StringProperty()
  title = db.StringProperty()
  publication_date = db.DateProperty()
  length_in_minutes = db.IntegerProperty()

You wish to import a CSV file that has the columns in the following order: title, artist, publication date, and length in minutes. The CSV file contains string representations for each of these values.

The following is a loader class file for this data:

import datetime
from google.appengine.ext import db
from google.appengine.tools import bulkloader
import models

class AlbumLoader(bulkloader.Loader):
  def __init__(self):
    bulkloader.Loader.__init__(self, 'Album',
                               [('title', str),
                                ('artist', str),
                                ('publication_date',
                                 lambda x: datetime.datetime.strptime(x, '%m/%d/%Y').date()),
                                ('length_in_minutes', int)
                               ])

loaders = [AlbumLoader]

The bulk loader tool looks for a global variable in the loader class file named loaders, whose value is a list of loader classes to use. In this example, the tool loads the AlbumLoader class for loading entities of the kind Album.

The loader class defines an __init__() method that calls the __init__() on the Loader class. The first argument is self, the loader class instance. The second argument is the name of the datastore kind as a string, in this case 'Album'. The third argument is a sequence of tuples, where each tuple contains the name of the property (a string) and a conversion function. The conversion function must take a string and return one of the datastore value types.

In this example, the 'title' and 'artist' properties both take string values, so the conversion function is str, the string constructor. The 'length_in_minutes' property takes an integer, and the int constructor takes a string and converts it to an integer value.

For the 'publication_date' property, the model needs a datetime value. In this case, we know (from our data file) that publication dates are represented in the form mm/dd/yyyy. The conversion function is a Python lambda expression (a short function) that takes a string, then passes it to datetime.datetime.strptime() with a pattern to parse the value to a datetime.datetime, then calls its date() method to get the final datetime.date value.

If any conversion function raises an exception or fails to return a value that meets the requirements of the Model class, processing of the data file stops and reports an error. If you are using a progress file (with Sqlite, as described above), you can fix the CSV file (or the code, as needed), then re-run the upload to continue starting from the row where the error occurred.

If the name of the file containing the loader class definition is album_loader.py, you would give appcfg.py upload_data the following argument: --config_file=album_loader.py

Preparing Your Data

appcfg.py upload_data accepts data in the form of a file in the CSV (Comma Separated Values) format, a simple text file representing a table of values, with one row per line and columns delimited by commas (,). If a value contains one or more commas, the value is surrounded by double quotes ("). If a value contains a double quote, the double quote appears twice (""). The tool uses the csv module from the Python standard library to parse the data file.

The file must use the UTF-8 encoding for the text data. UTF-8 encoding is compatible with ASCII encoding, so any spreadsheet application that saves CSV as ASCII will also work.

Most spreadsheet applications can export a sheet as a CSV file. To export a sheet as CSV from Google Spreadsheets, select the File menu > Export > .csv Sheet Only, then save the file from the browser window that opens.

If the name of the CSV file to load is album_data.csv, you would give appcfg.py upload_data the following argument: --filename=album_data.csv

If the CSV file's first row is not data (such as if it's a header row), use the following option to skip the first row: --has_header

Uploading the Data to App Engine

To start the data upload, run appcfg.py upload_data with the appropriate arguments:

appcfg.py upload_data --config_file=album_loader.py --filename=album_data.csv --kind=Album <app-directory>

If you are using a Google Apps domain name and need appcfg.py to sign in using an account on that domain, you must specify the --auth_domain=... option, whose value is your domain name.

If the transfer is interrupted, you can resume the transfer from where it left off using the --db_filename=... argument. The value is the name of the progress file created by the tool, which is either a name you provided with the --db_filename argument when you started the transfer, or a default name that includes a timestamp. This assumes you have sqlite3 installed, and did not disable the progress file with --db_filename=skip.

Loading Data Into the Development Server

If you'd like to test how your data works with the app before uploading it, you can load it into the development server. Use the --url option to point the tool at the development server URL. For example:

appcfg.py upload_data --config_file=album_loader.py --filename=album_data.csv --kind=Album --url=http://localhost:8080/remote_api <app-directory>

Creating Exporter Classes

To download data, you must tell appcfg.py how to export datastore entities to a file. Similar to uploading and loader classes, you do this with a file of Python code. The file imports or defines the Model classes, defines an exporter class for each kind you wish to export, and declares the available exporter classes in a global variable.

Continuing the example with the Album Model class defined above, the following file defines an exporter that produces a CSV file in the same format used by the loader class:

from google.appengine.ext import db
from google.appengine.tools import bulkloader
import models

class AlbumExporter(bulkloader.Exporter):
  def __init__(self):
    bulkloader.Exporter.__init__(self, 'Album',
                                 [('title', str, None),
                                  ('artist', str, None),
                                  ('publication_date', str, None),
                                  ('length_in_minutes', str, None)
                                 ])

exporters = [AlbumExporter]

The exporter class is similar to the loader class. The exporter class defines an __init__() method that calls the __init__() on the Exporter class. The first argument is self, the exporter instance. The second argument is the name of the datastore kind being exported, a string. The third argument is a sequence of tuples, one for each entity property being exported.

Each tuple has 3 elements: the property name, a function that takes the property value and converts it to a str, and a default value if the entity does not have the property set. If the default value is None, the exporter raises an exception if it attempts to export an entity that does not have the property.

The CSV file created by the exporter includes one row for each entity, and one column for each property mentioned in the sequence. The columns are in the order they appear in the sequence.

Tip: You can define loader classes and exporter classes in the same file, and use the same arguments for appcfg.py upload_data and appcfg.py download_data to transfer data in both directions.

Downloading Data from App Engine

To start a data download, run appcfg.py download_data with the appropriate arguments:

appcfg.py download_data --config_file=album_loader.py --filename=album_data_archive.csv --kind=Album <app-directory>

If you are using a Google Apps domain name and need appcfg.py to sign in using an account on that domain, you must specify the --auth_domain=... option, whose value is your domain name.

If the transfer is interrupted, you can resume the transfer from where it left off using the --db_filename=... and --result_db_filename=... arguments. These arguments are the names of the progress file and the results file created by the tool, which are either names you provided with the arguments when you started the transfer, or default names that include a timestamp. This assumes you have sqlite3 installed, and did not disable progress files with --db_filename=skip.

Command Line Arguments

The appcfg.py upload_data command accepts the following arguments. See also the other options appcfg.py accepts for all actions, listed in Uploading and Managing a Python App: Command-Line Arguments.

appcfg.py upload_data [options] <app-directory>
--filename=...

Required. The path to the CSV data file to load.

--kind=...

Required. The name of the datastore kind to use for creating new entities.

--config_file=...

Required. A Python source file that imports or defines Model classes for the kinds of entities that an upload might create, as well as Loader classes for each kind. appcfg.py upload_data provides the Loader base class in the local namespace when evaluating this file.

--loader_opts=...

An option to pass to the Loader class's initialize() method. You can implement this method to pass arguments to your Loader classes.

--log_file=...

The name of the file to write logging information about the upload. The default is to create a file named bulkloader-log-timestamp in the current working directory (where timestamp is the time the tool is run).

--auth_domain=...

The name of the authorization domain of the account to use to contact remote_api. If you're using a Google Apps domain and need appcfg.py to sign in using a Google Apps account, specify your domain name with this option.

--num_threads=#

The number of threads to spawn to upload new entities in parallel. The default is 10.

--batch_size=#

The number of entities to create with each remote_api call. For large entities, use a small batch size to limit the amount of data posted per batch. The default is 10.

--bandwidth_limit=#

The maximum total number of bytes per second that all threads should send. Bursts may exceed this maximum, but the overall bandwidth will be less than this amount. The default is 250,000 bytes per second.

--rps_limit=#

The maximum total number of records per second that all threads should send. The default is 20.

--http_limit=#

The maximum total number of HTTP requests per second that all threads should send. The default is 7.5 per second (15 per 2 seconds).

--db_filename=...

The filename to use for the progress file for this run. If not specified, this is named bulkloader-progress-timestamp, where timestamp represents the time the command is run. If this argument is specified with the value skip, the upload will not use a progress file.

--has_header

If present, skips the first row in the CSV file, assuming it's a header row.

--app_id=...

The application ID, if different from the application ID specified in the app's app.yaml file.

--url=...

The URL of the remote_api handler to use to connect to the datastore. By default, this is derived from the app's app.yaml file.

appcfg.py download_data [options] <app-directory>

The arguments for download_data are the same as those for upload_data. The --filename argument specifies the file to write to, instead of the file to read from. This file is overwritten when starting a new download, and appended to when continuing an interrupted download.

download_data also supports the following argument:

--result_db_filename=...

The filename to use for the results file for this run. The results file stores the exported data until the export is finished, so the final output file can be sorted by key. If not specified, this is named bulkloader-results-timestamp, where timestamp represents the time the command is run. If this argument is specified with the value skip, the upload will not use a results file.