GlueのDynamicFrameでS3へファイル書き出す時はCSVやJSONよりParquetが早い

DynamicFrameを使った開発をしていたら、大した処理していないのに、想像以上に時間がかかるなと思って調べていたら、JSONの書き出しが時間かかっていました。

タイトルの通り、JSONやCSVでのS3出力と比較してParquetでの出力は凄い早いというお話です。処理全体に影響するくらいの差が出ました。

利用するデータ
処理内容
結果

利用するデータ

AWSから提供されているParquet形式のELBログです。

$ aws s3 ls s3://athena-examples-us-east-1/elb/parquet/year=2015/month=1/ --recursive --summarize --human-readable
#--- 省略
2017-02-16 09:43:27  370.4 MiB elb/parquet/year=2015/month=1/day=8/part-r-00157-e764ec48-9e47-4c2c-8d00-68b3a8534cc2.snappy.parquet
2017-02-16 09:43:34  359.4 MiB elb/parquet/year=2015/month=1/day=9/part-r-00013-e764ec48-9e47-4c2c-8d00-68b3a8534cc2.snappy.parquet

Total Objects: 31
   Total Size: 11.0 GiB

処理内容

何もしないです、上記のデータをカタログ使って読み込んで、そのままS3にパーティション保持して出力しています。

Parquet -> JSON

datasource0 = glueContext.create_dynamic_frame.from_catalog(
    database = "elbsample", 
    table_name = "elb_parquet", 
    push_down_predicate = "year='2015' and month='01'", 
    transformation_ctx = "datasource0"
)

datasink1 = glueContext.write_dynamic_frame.from_options(
    frame = datasource0,
    connection_type = "s3", 
    connection_options = {
        "path": "s3://mybucket/elb/json",
        "partitionKeys": ['year','month','day']
    }, 
    format = "json", 
    transformation_ctx = "datasink1"
)

Parquet -> JSON(Gzip)

datasink1 = glueContext.write_dynamic_frame.from_options(
    frame = datasource0,
    connection_type = "s3", 
    connection_options = {
        "path": "s3://mybucket/elb/jsongzip",
         "compression": "gzip",
        "partitionKeys": ['year','month','day']
    }, 
    format = "json", 
    transformation_ctx = "datasink1"
)

Parquet -> CSV(Gzip)

datasink2 = glueContext.write_dynamic_frame.from_options(
    frame = datasource0, connection_type = "s3", 
    connection_options = {
        "path": "s3://otomo-glue-test-us-east-1/elb/csvgzip",
        "compression": "gzip",
        "partitionKeys": ['year','month','day']
    }, 
    format = "csv", 
    transformation_ctx = "datasink2"
    )

Parquet -> Parquet

datasink1 = glueContext.write_dynamic_frame.from_options(
    frame = datasource0,
    connection_type = "s3", 
    connection_options = {
        "path": "s3://mybucket/elb/parquet",
         "compression": "gzip",
        "partitionKeys": ['year','month','day']
    }, 
    format = "csv", 
    transformation_ctx = "datasink1"
)

他にも

JSON(gzip) -> ParquetやCSV(gzip) -> Parquetなどの変換も比較のため行いました。上で出力したJSONからParquetへの変換も比較のため行いました。Parquet-> Parquetより時間がかかってました。後は以下の記事見ると読み込みにも差がでるらしいのですが、Parquetの方が遅そうにも見えるし。スキーマの解決が必要だからとかなのかなと想像。でも、それでもJSONやCSVに書き出すよりは圧倒的に早い。

future-architect.github.io

結果

変換	出力サイズ	所要時間
Parquet -> JSON(Gzip)	16.2GB(93Objs)	50分
Parquet -> JSON	191.6GB(93Objs)	49分
Parquet -> CSV(Gzip)	13.3GB(93Objs)	45分
CSV(Gzip)-> JSON(Gzip)	16.2GB(93Objs)	53分
Parquet-> Parquet	11.1GB(126Objs)	14分
JSON(Gzip)-> Parquet	11.1GB(126Objs)	7分
CSV(Gzip)-> Parquet	11.1GB(126Objs)	13分