AWS Data Lake Solutionを触ってみました(Lake Fomationの前身？)

先月のre:Invent 2018で発表になったLake Formation。現地で聞いていて即効プレビュー申し込んであるのですがまだ使えていません。

そんな時に一緒に仕事をしている方から、こんなモノがあることを聞きました。

aws.amazon.com

こちらから引用したアーキテクチャです。S3を使ったデータレイクの管理機能を色々ラッピングしてくれています。

デプロイしてSPAにログオン下画面がこちら。AWSコンソールみたいです。これはLake Formationそのものに近いのでは？ということで触ってみたので、メモ纏めておきます。

f:id:yomon8:20181214104053p:plain

構築
コンセプト
使ってみる
最後に
参考

構築

構築といってもCloudFormationをデプロイするだけです。

こちらのURLにアクセスして、

Automated Deployment - Data Lake Solution

この Launch Solution のボタンをクリックします。

必須パラメータを適当に埋めて実行すればOKです。 ※必要に応じてオプションやIAM設定しますが、そのままCAPABILITY_IAMでも触ってみるだけなら十分です。

CloudFormationはVirginiaリージョンに展開されます。日本(ap-northeast-1)にも展開できますが、その場合はGithubのリポジトリからこの手順で自分でビルドします。

CloudFormationはネストされていて、全て展開されるのに20分以上かかりました。

f:id:yomon8:20181214114040p:plain

構築が完了すると、設定したメールアドレスにCogniteから以下のようなメールが届くのでhttps://xxxxxxx.cloudfront.netのアドレスからジャンプします。

項目	値
差出人	no-reply@verificationemail.com
件名	Your Data Lake account.
本文	You are invited to join the Data Lake. Your Data Lake username is yusuke_otomo_beex-inc_com and temporary password is xxxxxxxxx. Please sign in to the Data Lake with your email address and your temporary password at https://xxxxxxxxxxx.cloudfront.net .

ログオン画面が表示されるので、メールアドレスと、Inviteメール内に記載されていたパスワードでログインします。

f:id:yomon8:20181214114356p:plain

コンセプト

これでData Lake Solutionを展開できたわけですが、使い始める前にコンセプト確認しておきます。

ここを読んでみると、パッケージという単位が重要なコンセプトになるようです。パッケージは、複数のデータセットや、それに紐づくメタデータ、権限などを纏められる単位です。

Data lake solution concepts

The central concept of this data lake solution is a package. This is a container in which you can store one or more files. You can also tag the package with metadata so you can easily find it again. For example, the data you need to store may come from a vast network of weather stations. Perhaps each station sends several files containing sensor readings every 5 minutes. In this case, you would build a package each time a weather station sends data. The package would contain all the sensor reading files and would be tagged with metadata, such as, the location of each station with the date and time on which the readings were taken. You can configure the data lake solution to require that all packages have certain tags. This helps ensure you maintain visibility on the data added to your lake.

使ってみる

それでは使ってみます。

認証

入り口の認証から。

認証はCogniteがデフォルトです。ユーザーやグループの操作はWEBの画面から可能なので、Cognite知らなくても操作は可能です。このCogniteの認証はElasticSearchやKibanaにも適用されます。

WEBアプリ上は、管理者orメンバーとグループの組み合わせで権限制御がされるます。

やってないですが、ADFSとの連携した認証もできるみたいです。その場合はこちらのテンプレートを使ってデプロイするらしいです。

https://github.com/awslabs/aws-data-lake-solution/blob/fa800dd1b2339184377742a6dc96b1e53f47b6ff/deployment/data-lake-deploy-federated.template

データの登録

Data Lakeにデータを登録します。

Pacakge作成

まずはパッケージを作成します。

Visibilityはパッケージの権限みたいなもので、検索対象に出てくるかどうかを設定できます。

必須タグとオプションタグというのがありますが、アプリケーションのグローバル設定のGovernanceという項目で設定できます。

データ登録

WEB画面からの場合はContentタブからデータを登録します。

Saveボタンを押すと、S3にデータ本体が、DynamoDBやElasticSearchにメタデータ等が保存されます。

S3のパスは、デフォルトで s3://data-lake-us-east-1-accountid/packageid/unixtime/filename になっています。

それと同時にIntegrationというタブでCrawlerが走り出しています。これはGlueのCrawlerのステータスです。完了するとGlueにテーブルが作成されて、View DataのリンクからはAthenaに飛ぶことできます。

ファイルはGlueのパターン形式のManifestファイルを使って追加することもできます。

{
    "dataStore": [
        {
            "includePath": "s3://<sample-bucket-name>/nyc-taxi-tlc/yellow/",
            "excludePatterns": ["2010/*", "2011/*", "2012/*", "2013/*", "2014/*"]
        },
        {
            "includePath": "s3://<sample-bucket-name>/nyc-taxi-tlc/green/",
            "excludePatterns": ["2013/*", "2014/*"]
        },
        {
            "includePath": "s3://<sample-bucket-name>/nyc-taxi-tlc/fhv/"
        }
    ]
}

データの利用

今度は利用側の流れです。

検索

タグやパッケージの説明文などメタデータの情報やをフルテキスト検索ができます。検索に合致して権限があるパッケージが表示されます。

GlueのCrawlerで取得したスキーマのカラム名も同様にElasticsearchに記録されているので検索にひっかかります。

カートに追加

パッケージ内のデータセットをカートに追加します。

Manifestを生成

カートに追加したパッケージからManifestを生成します。

f:id:yomon8:20181214140724p:plain

Manifestは以下の２タイプです。

Amazon S3 Signed URLs
Amazon S3 Bucket/Keys

Amazon S3 Signed URLs

Amazon S3 Signed URLsを選ぶとpre-signedのURLの一覧が取得できます。 URLを使ってそのままオブジェクトを認証無しで利用することができます。

{
    "entries": [
        {
            "url": "https://data-lake-us-east0-123456789012.s3.amazonaws.com/mNOk0PHrL/1544760749933/aws-vpa-tweets-sample.gz?X-Amz-Algorithm=省略"
        },
        {
            "url": "https://data-lake-us-east0-123456789012.s3.amazonaws.com/mNOk0PHrL/1544760749933/aws-vpa-tweets-sample2.gz?X-Amz-Algorithm=省略"
        },
        {
            "url": "https://data-lake-us-east0-123456789012.s3.amazonaws.com/mNOk0PHrL/1544760749933/aws-vpa-tweets-sample3.gz?X-Amz-Algorithm=省略"
        }
    ]
}

Amazon S3 Bucket/Keys

Amazon S3 Bucket/Keysを選ぶと単純にバケットとObject Keyの一覧が落ちてきます。

{
    "entries": [
        {
            "bucket": "data-lake-us-east-1-123456789012",
            "key": "mNOk0PHrL/1544760749933/aws-vpa-tweets-sample.gz"
        },
        {
            "bucket": "data-lake-us-east-1-123456789012",
            "key": "mNOk0PHrL/1544757712107/aws-vpa-tweets-sample2.gz"
        },
        {
            "bucket": "data-lake-us-east-1-123456789012",
            "key": "mNOk0PHrL/1544757712107/aws-vpa-tweets-sample3.gz"
        }
    ]
}

設定

Settings -> General

確認できる内容

このアプリのURL
S3のバケット名
ESSの情報
- ElasticSearchのURL
- KibanaのURL
CogniteのPoolのID

KibanaやESのリンクに飛べます。Data Lake Solutionのユーザとパスワードで認証できます。

設定できる内容

検索の結果数(件)
Manifestの期限（秒)

Settings -> Governance

パッケージに設定する必須タグを事前に指定できます

その他

ログ

ログはCloudWatch Logsに集まっているので、Insightsなど使えば便利に解析できます。

API

データレイクが大規模になると手作業には限界があるので、APIを使って作業もあるかと思います。

このファイル見るとどんなAPIあるのかわかります。Cfnでデプロイ済みならAPI Gateway直接見てみても良いと思います。

https://github.com/awslabs/aws-data-lake-solution/blob/fa800dd1b2339184377742a6dc96b1e53f47b6ff/deployment/data-lake-api.yaml

APIのアクセスキーはProfile画面から生成することができます。

アクセスキーを使った認証は以下に記載されています。

http://docs.awssolutionsbuilder.com/data-lake/api/working-with-api/

CLI

CLIまでありますよ。インストール方法はここ見てください。

$ datalake help
Usage: datalake [options] [command]

Options:
  -V, --version                           output the version number
  -h, --help                              output usage information

Commands:
  add-cart-item [parameters]              adds a package to the user's cart
  checkout-cart [parameters]              checks out a user's cart to generate manifest files for pending cart items
  create-group [parameters]               Creates a new group in the data lake Amazon Cognito user pool.
  create-package [parameters]             creates a new data lake package
  create-package-metadata [parameters]    creates a new data lake package
  delete-group [parameters]               Deletes the specified group from the data lake Amazon Cognito user pool. Currently only groups with no members can be deleted.
  describe-cart [parameters]              describes a user's cart
  describe-cart-item [parameters]         describes a item in the user's cart
  describe-package [parameters]           describes the details of a package
  describe-package-dataset [parameters]   describes a dataset associated to a package
  describe-package-datasets [parameters]  describes the datasets associated with a package
  describe-package-metadata [parameters]  describes the metadata associated with a package
  describe-required-metadata              list the required metadata for packages
  execute-package-crawler [parameters]    Starts a crawler for the specified package, regardless of what is scheduled. If the crawler is already running, the request is ignored.
  get-group [parameters]                  Retrieves a group from the data lake Amazon Cognito user pool.
  get-package-crawler [parameters]        Retrieves crawler metadata for a specified package.
  get-package-table-data [parameters]     Retrieves the external link to view table data in Amazon Athena.
  get-user-group-list [parameters]        Lists the groups that the user belongs to.
  import-package-manifest [parameters]    uploads a new import manifest file for a package
  list-groups [parameters]                Retrieves data lake groups from Amazon Cognito group pool.
  list-package-tables [parameters]        Retrieves the definitions of some or all of the tables in a given package.
  remove-cart-item [parameters]           removes a package from the user's cart
  remove-package [parameters]             removes a package from the data lake
  remove-package-dataset [parameters]     removes a dataset from a package
  remove-user-from-group [parameters]     Remove the specified user from the specified group.
  search [parameters]                     search data lake
  update-group [parameters]               Updates the specified group with the specified attributes.
  update-package [parameters]             overwrites the details for a package
  update-package-crawler [parameters]     Update the package crawler. If the package does not have one, a new crawler is created.
  update-user-group-list [parameters]     Updates the list of groups that the user belongs to.
  upload-package-dataset [parameters]     uploads a new dataset file for a package
  help [cmd]                              display help for [cmd]

最後に

カートとかが出てきた時は少し意外でしたが、使ってみると、ユーザーの振り出しから、データ登録、データ利用までの流れがスムーズにできることが体験できました。

このまま使えるかは規模や要件次第な気はしますが、色々な気付きがあったので、分野に興味があれば是非一度動かしてみることをオススメします。

もしデプロイして試した場合は、Cloud Formationのスタックの削除はお忘れなく。

参考

構築