google-app-engine


Why does BigQuery fail to parse an Avro file that is accepted by avro-tools?


I'm trying to export google cloud datastore data to Avro files in google cloud storage and then load those files into BigQuery.
Firstly, I know that Big Query loads datastore backups. This has several disadvantages that I'd like to avoid:
Backup tool is closed source
Backup tool format is undocumented.
Backup tool format cannot be read directly by Dataflow
Backup scheduling for appengine is in (apparently perpetual) alpha.
It is possible to implement your own backup handler in appengine, but it is fire and forget. You won't know when exactly the backup has finished or what the file name will be.
With the motivation clarified for this experiment here is my Dataflow Pipeline to export the data to avro format:
package com.example.dataflow;
import com.google.api.services.datastore.DatastoreV1;
import com.google.api.services.datastore.DatastoreV1.Entity;
import com.google.cloud.dataflow.sdk.Pipeline;
import com.google.cloud.dataflow.sdk.coders.AvroCoder;
import com.google.cloud.dataflow.sdk.io.AvroIO;
import com.google.cloud.dataflow.sdk.io.DatastoreIO;
import com.google.cloud.dataflow.sdk.io.Read;
import com.google.cloud.dataflow.sdk.options.DataflowPipelineOptions;
import com.google.cloud.dataflow.sdk.options.PipelineOptions;
import com.google.cloud.dataflow.sdk.options.PipelineOptionsFactory;
import com.google.cloud.dataflow.sdk.transforms.DoFn;
import com.google.cloud.dataflow.sdk.transforms.ParDo;
import org.apache.avro.Schema;
import org.apache.avro.file.DataFileReader;
import org.apache.avro.file.DataFileWriter;
import org.apache.avro.file.SeekableByteArrayInput;
import org.apache.avro.generic.GenericDatumReader;
import org.apache.avro.generic.GenericRecord;
import org.apache.avro.io.DatumReader;
import org.apache.avro.protobuf.ProtobufData;
import org.apache.avro.protobuf.ProtobufDatumWriter;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.io.ByteArrayOutputStream;
public class GCDSEntitiesToAvroSSCCEPipeline {
private static final String GCS_TARGET_URI = "gs://myBucket/datastore/dummy";
private static final String ENTITY_KIND = "Dummy";
private static Schema getSchema() {
return ProtobufData.get().getSchema(Entity.class);
}
private static final Logger LOG = LoggerFactory.getLogger(GCDSEntitiesToAvroSSCCEPipeline.class);
public static void main(String[] args) {
PipelineOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().create();
Pipeline p = Pipeline.create(options);
DatastoreV1.Query.Builder q = DatastoreV1.Query.newBuilder()
.addKind(DatastoreV1.KindExpression.newBuilder().setName(ENTITY_KIND));
p.apply(Read.named("DatastoreQuery").from(DatastoreIO.source()
.withDataset(options.as(DataflowPipelineOptions.class).getProject())
.withQuery(q.build())))
.apply(ParDo.named("ProtoBufToAvro").of(new ProtoBufToAvro()))
.setCoder(AvroCoder.of(getSchema()))
.apply(AvroIO.Write.named("WriteToAvro")
.to(GCS_TARGET_URI)
.withSchema(getSchema())
.withSuffix(".avro"));
p.run();
}
private static class ProtoBufToAvro extends DoFn<Entity, GenericRecord> {
private static final long serialVersionUID = 1L;
#Override
public void processElement(ProcessContext c) throws Exception {
Schema schema = getSchema();
ProtobufDatumWriter<Entity> pbWriter = new ProtobufDatumWriter<>(Entity.class);
DataFileWriter<Entity> dataFileWriter = new DataFileWriter<>(pbWriter);
ByteArrayOutputStream bos = new ByteArrayOutputStream();
dataFileWriter.create(schema, bos);
dataFileWriter.append(c.element());
dataFileWriter.close();
DatumReader<GenericRecord> datumReader = new GenericDatumReader<>(schema);
DataFileReader<GenericRecord> dataFileReader = new DataFileReader<>(
new SeekableByteArrayInput(bos.toByteArray()), datumReader);
c.output(dataFileReader.next());
}
}
}
The pipeline runs fine, however when I try to load the resultant Avro file into big query I get the following error:
bq load --project_id=roodev001 --source_format=AVRO dummy.dummy_1 gs://roodev001.appspot.com/datastore/dummy-00000-of-00001.avro
Waiting on bqjob_r5c9b81a49572a53b_00000154951eb523_1 ... (0s) Current status: DONE
BigQuery error in load operation: Error processing job 'roodev001:bqjob_r5c9b81a49572a53b_00000154951eb523_1': The Apache Avro library failed to parse file
gs://roodev001.appspot.com/datastore/dummy-00000-of-00001.avro.
However if I load the resultant avro file with avro tool, everything is just fine:
avro-tools tojson datastore-dummy-00000-of-00001.avro | head
log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
{"key":{"com.google.api.services.datastore.DatastoreV1$.Key":{"partition_id":{"com.google.api.services.datastore.DatastoreV1$.PartitionId":{"dataset_id":"s~roodev001","namespace":""}},"path_element":[{"kind":"Dummy","id":4503905778008064,"name":""}]}},"property":[{"name":"number","value":{"boolean_value":false,"integer_value":879,"double_value":0.0,"timestamp_microseconds_value":0,"key_value":null,"blob_key_value":"","string_value":"","blob_value":"","entity_value":null,"list_value":[],"meaning":0,"indexed":true}}]}
...
I used this code to populate the datastore with dummy data before running the Dataflow pipeline:
package com.example.datastore;
import com.google.gcloud.AuthCredentials;
import com.google.gcloud.datastore.*;
import java.io.IOException;
public static void main(String[] args) throws IOException {
Datastore datastore = DatastoreOptions.builder()
.projectId("myProjectId")
.authCredentials(AuthCredentials.createApplicationDefaults())
.build().service();
KeyFactory dummyKeyFactory = datastore.newKeyFactory().kind("Dummy");
Batch batch = datastore.newBatch();
int batchCount = 0;
for (int i = 0; i < 4000; i++){
IncompleteKey key = dummyKeyFactory.newKey();
System.out.println("adding entity " + i);
batch.add(Entity.builder(key).set("number", i).build());
batchCount++;
if (batchCount > 99) {
batch.submit();
batch = datastore.newBatch();
batchCount = 0;
}
}
System.out.println("done");
}
So why is BigQuery rejecting my avro files?
BigQuery uses the C++ Avro library, and apparently it doesn't like the "$" in the namespace. Here's the error message:
Invalid namespace: com.google.api.services.datastore.DatastoreV1$
We're working on getting these Avro error messages out to the end user.

Related Links

Google AppEngine ClientId and Client Secrets
Which SDK Version Does Appengine Use in Production
can not figure out relation between yaml and main page handler in google app engine
NoClassDefFoundError when adding new font in iText on AppEngine
Facebook login in Google Cloud Endpoints
To share a local host for go gae?
App-engine: JAX-RS with Jersey no working
Bi-directional one-to-many relationship in google app engine using JPA
How to check if field with value None is stored in datastore or not stored at all?
DeadlineExceededError in self.response.write
Google Drive invalid credentials
Google App Engine endpointscfg.py command starting 1.8.6 does not accept argument -f
Google Checkout Order Report API -> Google Wallet analog?
Access Denied exception when using google-api-java-client
Exception on trying to access the entity using key in GAE datastore
How to read servlet data from Google map

Categories

HOME
url-redirection
cygwin
oracle12c
facebook-oauth
delphi-7
time-complexity
nvd3.js
rsa
mysql-workbench
package
bitbucket-api
3d-reconstruction
maxscript
zip
crystal-lang
p2p
fhir
radgridview
selenium-ide
searchview
null-pointer
asciidoctor-pdf
articulate-storyline
cgbitmapcontext
php-mysqlidb
exploit
oculus
subclass
bioinformatics
gitlab-api
playstation-portable
autolisp
code-snippets
nested-lists
python-import
timing
release
filesystemwatcher
loading
chef-solo
stringtokenizer
plaintext
vmware-fusion
datacachefactory
wmp
http-request
angular-http
importerror
oim
wcftestclient
netbeans6.8
openblas
posixct
d3-force-directed
bootstrap-tags-input
usart
latin
shopping
ikiwiki
phpgrid
jbase
inf
qt5.2
page.js
.bash-profile
finalbuilder
nutiteq
typeconverter
matlab-deployment
code-documentation
public-html
procedural-programming
mass-assignment
ejb-2.x
dbaccess
webmatrix-3
sip-server
easyrtc
human-computer-interface
mdichild
authlogic
vbe
amslidemenu
syndicationfeed
modular
kaleidoscope
fb.ui
fragmenttransaction
xcode3.2
dajaxice
garbage
subproject
httponly
wiimote
cross-database
listactivity
objective-j
glui
information-management

Resources

Encrypt Message