生成AI初心者がAmazon BedrockのKnowledge baseを使ってRAGを試してみた

本記事は TechHarmony Advent Calendar 12/7付の記事です。

こんにちは。SCSK石原です。

AWS re:Invent2023にて、Amazon BedrockのKnowledge baseとAgentsがGAされたと発表がありました。今回はこのうちKnowledge baseを利用して、RAG(Retrieval Augment Generation)を試してみたいと思います。

RAGにより、データストアから情報を取得して大規模言語モデル (LLM) によって生成された応答を拡張することができます。これにより、社内ドキュメントや一般に公開されていない情報を回答してくれる機能を実現することができます。

このビックウェーブに乗りたい一心で、生成AI初心者がサービスに触れてみた結果を記事にします。

概要

RetrieveAndGenerate APIを使って内部ドキュメントからいい感じに検索してもらって、その結果をLLMをつかっていい感じにテキスト生成して出力してもらうというのが、今回のゴールになります。AWSのサービスでいうと下記の手順になります。

  • Amazon S3に内部ドキュメントを保存する。
  • Knowlegde baseを利用して、検索させたい情報をエンベディングしてベクトルDB(OpenSearch)に保存する。
  • RetrieveAndGenerate APIが叩ける何かしらを実行する。(今回はAWS CLIから実行予定)

記事執筆時点(2023/12/04)で利用可能なバージニア北部(us-east-1)で実行しています。

設定

Amazon S3に内部ドキュメントを保存する

バージニア北部(us-east-1)でAmazon Bedrockを構成するため、Amazon S3も同じリージョンに作成します。

今回はお試しなので、普段業務で利用しているインフォマティカ製品のドキュメントを内部ドキュメントとして保存します。PDFエクスポートできるので、エクスポートしてS3バケットに保存します。

今回保存した情報は、こちらのサイトから取得しています。

Knowlegde baseでS3に保存したドキュメントをエンベディング

下記の手順で作成できます。

今回はこちらのパラメータで設定しました。

Knowlegde base name knowledge-base-techharmony-01
IAM permissions Create and use a new service role
Data source name knowledge-base-techharmony-01-data-source
S3 URI s3://ishihara-bedrock-knowledge
Embeddings model Titan Embeddings G1 Text v1.2
Vector database Quick create a new vector store

数分で構築が完了します。

構築が完了したら、Data Sourceタブで先ほど指定したバケットを「sync」をクリックして同期させます。

今回は7つのPDFファイルで17.5MB程度でしたが、1分程度で同期処理が完了しました。

同期処理が完了すると、テストができますので適当に検索してみました。何かしら検索出来ていることがわかります。

 

RetrieveAndGenerate APIを実行(AWS CLI)

とりあえずお試しがてら、AWSコンソールからCloudShellを起動して、AWS CLIでコマンドをしらべてみようと思いおもむろにHELPを実行。

[cloudshell-user@ip-10-132-39-170 ~]$ aws bedrock-agent-runtime help

usage: aws [options]   [ ...] [parameters]
To see help text, you can run:

  aws help
  aws  help
  aws   help

aws: error: argument command: Invalid choice, valid choices are:
・・・中略
Invalid choice: 'bedrock-agent-runtime', maybe you meant:

  * bedrock-runtime

[cloudshell-user@ip-10-132-39-170 ~]$ 

コマンドがないだと!!!

さすが最新アップデートの機能です。焦らずにAWS CLIのアップデートをしましょう。

[cloudshell-user@ip-10-132-39-170 ~]$ aws --version
aws-cli/2.13.34 Python/3.11.6 Linux/6.1.59-84.139.amzn2023.x86_64 exec-env/CloudShell exe/x86_64.amzn.2 prompt/off
[cloudshell-user@ip-10-132-39-170 tmp]$ curl -s "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
[cloudshell-user@ip-10-132-39-170 tmp]$ ls
awscliv2.zip  tmux-1000  v8-compile-cache-0
[cloudshell-user@ip-10-132-39-170 tmp]$ unzip -q awscliv2.zip
[cloudshell-user@ip-10-132-39-170 tmp]$ sudo ./aws/install --update
You can now run: /usr/local/bin/aws --version
[cloudshell-user@ip-10-132-39-170 tmp]$ aws --version
aws-cli/2.14.5 Python/3.11.6 Linux/6.1.59-84.139.amzn2023.x86_64 exec-env/CloudShell exe/x86_64.amzn.2 prompt/off
[cloudshell-user@ip-10-132-39-170 tmp]$ 

AWS CLIが2.13.34から2.14.5にアップデートされました。

まずはKnowledge baseの画面上でテストした検索のみを「bedrock-agent-runtime retrieve コマンド」を実行しました。

先ほどと同様、S3バケットに保存したPDFからそれっぽい結果が出力されています。

[cloudshell-user@ip-10-132-39-170 tmp]$ aws bedrock-agent-runtime retrieve \
> --knowledge-base-id 5TSXYNE1QK \
> --retrieval-query text="What can you do with CDMP?"
{
    "retrievalResults": [
        {
            "content": {
                "text": ".  (2023 October)   CDMP-24657 When you try to remove an already existing value of a date type custom field of a data collection,  Data Marketplace displays an error even if you haven't configured the field as a mandatory field in  Metadata Command Center. (2023 October)   CDMP-24605 When you try to create a data collection, Data Marketplace displays an error even if your user  profile is assigned the Data Owner and Category Owner roles. (2023 September)   CDMP-24462 When you select an asset in the Data Assets grid on the Create Linked Data Assets wizard, the  grid reverts to its default state. (2023 September)   Known issues The following table describes the known issues in Data Marketplace:   CR Description   CDMP-24363 In Metadata Command Center, if you modify the acceptable values of a custom attribute of a Data  Marketplace order, the response of the Data Marketplace API that you use to retrieve the order  details displays the value for the custom attribute that was previously present in the same  position in the sequence of acceptable values. For example, for a custom field the acceptable values are A, B and C. In Metadata Command  Center, you replace the acceptable values with new values 1, 2 and 3."
            },
            "location": {
                "type": "S3",
                "s3Location": {
                    "uri": "s3://ishihara-bedrock-knowledge/DMP_2023November_ReleaseNotes_en.pdf"
                }
            },
            "score": 0.6106315
        },
        {
            "content": {
                "text": "Sort the search result  by various parameters.   To sort the search results, you can enter one of  the following values: - NAME - STATUS - ID - CREATED_BY - CREATED_ON - MODIFIED_BY - MODIFIED_ON Default value is MODIFIED_ON.   Delivery formats       31        Parameter Description Additional Information   sort Optional. Set the sorting order  of the search results.   Enter one of the following values: - To sort the search results by ascending order,    enter ASC. - To sort the search results by descending    order, enter DESC. Default value is DESC.   offset Optional. The starting index for  the paginated results.   Default value is 0.   limit Optional. The maximum  number of results.   Default value is 50.   Note: The API has no payload.   Example request The following example shows how you can use an API to retrieve the details of delivery format:   https://{{CDMP_URL}}/api/v1/integration/provisioning/deliveryFormats   Response When you pass the API query parameters in the REST client, the client displays a response for the parameter  values that you have entered."
            },
            "location": {
                "type": "S3",
                "s3Location": {
                    "uri": "s3://ishihara-bedrock-knowledge/DMP_2023November_(API)Reference_en.pdf"
                }
            },
            "score": 0.61188567
        },
        {
            "content": {
                "text": "Ensure that you don't use the prefix  value that is configured in Metadata  Command Center.   status Optional. The status of the consumer  access that you want to create.   Enter one of the following values: - For an active consumer access, enter  AVAILABLE.   - For a consumer access that is awaiting  withdrawal, enter PENDING_WITHDRAW.   - For a consumer access that is  withdrawn, enter WITHDRAWN.   Default value is AVAILABLE.   dataCollectionId Required. The system generated  unique identifier of the data collection  to which the Data User was granted  access.   For more information about how you can  use an API to get the unique identifier of a  data collection, see “Retrieve data  collections” on page 125. To get the unique identifier of a data  collection from the Data Marketplace  interface, open the data collection. The  data collection page's URL contains the  unique identifier. For example, in the URL https:// {{CDMP_URL}}/datacollection/ 25158afc-3dfb-44ef-8f3e- cec1e171d0f1?dtn=&tab=summary, the  unique identifier is  25158afc-3dfb-44ef-8f3e- cec1e171d0f1.   deliveryTargetId Required. The system generated  unique identifier of the delivery target  that was used to deliver the data."
            },
            "location": {
                "type": "S3",
                "s3Location": {
                    "uri": "s3://ishihara-bedrock-knowledge/DMP_2023November_(API)Reference_en.pdf"
                }
            },
            "score": 0.6162587
        },
        {
            "content": {
                "text": "The  value of the sessionId parameter in the response body is the value that you must use for the userSessionId  parameter. For more information about how you can retrieve the sessionId value, see the Login topic in the  REST API Reference help in Administrator. The Bearer is a JSON Web Token. For more information about how  you can retrieve the JSON Web Token, see the Generating and getting JWT tokens for managed APIs topic in  the API Portal Guide help in API Portal.   10       Chapter 1: Introduction    https://knowledge.informatica.com/s/article/CDMP-Rest-Accelerator-Pack      Start properties In Application Integration, define the binding type and access details for the system service action that you  want to create.   The following table describes how to configure the Start properties for a system service action to call a Data  Marketplace:   Property Description   Binding If you want to run the process by using a service URL, select REST/SOAP as the binding type   Allowed Groups Specify the user groups that should have access to the process service URL at run time.   Allowed Users Specify the users that should have access to the process service URL at run time.   Allow anonymous  access   Ensure that you do not select the Allow anonymous access property. If you select Allow anonymous access, you cannot call Data Marketplace APIs."
            },
            "location": {
                "type": "S3",
                "s3Location": {
                    "uri": "s3://ishihara-bedrock-knowledge/DMP_2023November_(API)Reference_en.pdf"
                }
            },
            "score": 0.61763495
        },
        {
            "content": {
                "text": "status=ACTIVE https://{{CDMP_URL}}/ is the base URL and /api/v1/integration/provisioning/deliveryTemplates  is the API endpoint.   How to call a Data Marketplace API       9    https://network.informatica.com/docs/DOC-19140      The base URL varies based on your region. The following table shows the regions and their  corresponding base URLs:   Region Base URL   United States of America https://cdgc-api.dm-us.informaticacloud.com/cdmp-marketplace/   United Kingdom https://cdgc-api.dm-uk.informaticacloud.com/cdmp-marketplace/   Canada https://cdgc-api.dm-na.informaticacloud.com/cdmp-marketplace/   Europe, Middle East, Africa (EMEA) https://cdgc-api.dm-em.informaticacloud.com/cdmp-marketplace/   Asia, Pacific https://cdgc-api.dm-ap.informaticacloud.com/cdmp-marketplace/   Japan https://cdgc-api.dm-apne.informaticacloud.com/cdmp-marketplace/   You can call Data Marketplace APIs only via Application Integration. You can call only 100 APIs per minute."
            },
            "location": {
                "type": "S3",
                "s3Location": {
                    "uri": "s3://ishihara-bedrock-knowledge/DMP_2023November_(API)Reference_en.pdf"
                }
            },
            "score": 0.64015627
        }
    ]
}

 

今回はいい感じにテキスト生成もしてほしいので、「bedrock-agent-runtime retrieve-and-generate コマンド」を実行します。

モデルはclaude-v2を利用しました。何かしら回答してくれていますね。

[cloudshell-user@ip-10-132-39-170 tmp]$ aws bedrock-agent-runtime retrieve-and-generate \
> --input text="What can you do with CDMP?" \
> --retrieve-and-generate-configuration type=KNOWLEDGE_BASE,knowledgeBaseConfiguration="{knowledgeBaseId=5TSXYNE1QK,modelArn=arn:aws:bedrock:us-east-1::foundation-model/anthropic.claude-v2}"
{
    "sessionId": "f4132659-dbef-4411-8b44-3b2dd49e3f1c",
    "output": {
        "text": "The Informatica Cloud Data Marketplace Platform (CDMP) allows you to access, publish and subscribe to data services. You can use CDMP APIs to programmatically manage data services, users, subscriptions, etc. Some key things you can do with CDMP APIs include:"
    },
    "citations": [
        {
            "generatedResponsePart": {
                "textResponsePart": {
                    "text": "The Informatica Cloud Data Marketplace Platform (CDMP) allows you to access, publish and subscribe to data services. You can use CDMP APIs to programmatically manage data services, users, subscriptions, etc. Some key things you can do with CDMP APIs include:",
                    "span": {
                        "start": 0,
                        "end": 257
                    }
                }
            },
            "retrievedReferences": [
                {
                    "content": {
                        "text": "sort Optional. Set the sorting order of the search  results.   Enter one of the following  values: - To sort the search results by    ascending order, enter ASC. - To sort the search results by    descending order, enter DESC. Default value is DESC.   offset Optional. The starting index for the paginated  results.   Default value is 0.   limit Optional. The maximum number of results. Default value is 50.   Note: The API has no payload.   Example request The following example shows how you can use an API to retrieve the details of delivery method:   https://{{CDMP_URL}}/api/v1/integration/provisioning/deliveryMethods? search=SK&ids=909668e3-f91d-4cde-a7d5- e3dca316ce97&status=ACTIVE&createdDateFrom=2022-01-12&createdDateTo=2022-12-12&modifiedDa teFrom=2022-01-12&modifiedDateTo=2022-12-12&sortByField=NAME&sort=DESC&offset=0&limit=2   Response When you pass the API query parameters in the REST client, the client displays a response for the parameter  values that you have entered."
                    },
                    "location": {
                        "type": "S3",
                        "s3Location": {
                            "uri": "s3://ishihara-bedrock-knowledge/DMP_2023November_(API)Reference_en.pdf"
                        }
                    }
                },
                {
                    "content": {
                        "text": "sort Optional. Set the sorting order of the  search results.   Enter one of the following values: - To sort the search results by    ascending order, enter ASC. - To sort the search results by    descending order, enter DESC. Default value is DESC.   offset Optional. The starting index for the  paginated results.   Default value is 0.   limit Optional. The maximum number of results. Default value is 50.   Note: The API has no payload.   Example request The following example shows how you can use an API to retrieve the details of a delivery target:   https://{{CDMP_URL}}/api/v1/integration/provisioning/deliveryTargets? search=AWS&status=ACTIVE&sortByField=MODIFIED_BY&sort=ASC   Response When you pass the API query parameters in the REST client, the client displays a response for the parameter  values that you have entered.   The following example shows the response of an API call to retrieve the details of a delivery target:   {   \"processingTime\": 4190,   \"offset\": 0,   \"limit\": 50,   \"totalCount\": 1,   Delivery targets       61          \"objects\": [     {       \"id\": \"16e32a13-6fba-4f1a-a337-e0aeecf9fab4"
                    },
                    "location": {
                        "type": "S3",
                        "s3Location": {
                            "uri": "s3://ishihara-bedrock-knowledge/DMP_2023November_(API)Reference_en.pdf"
                        }
                    }
                },
                {
                    "content": {
                        "text": "sort Optional. Set the sorting order of the search  results.   Enter one of the following  values: - To sort the search results by    ascending order, enter ASC. - To sort the search results by    descending order, enter  DESC.   Default value is DESC.   offset Optional. The starting index for the paginated  results.   Default value is 0.   limit Optional. The maximum number of results. Default value is 50.   Note: The API has no payload.   Example request The following example shows how you can use an API to retrieve cost centers:   https://{{CDMP_URL}}/api/v1/integration/costCenters?"
                    },
                    "location": {
                        "type": "S3",
                        "s3Location": {
                            "uri": "s3://ishihara-bedrock-knowledge/DMP_2023November_(API)Reference_en.pdf"
                        }
                    }
                },
                {
                    "content": {
                        "text": "Sort the search result  by various parameters.   To sort the search results, you can enter one of  the following values: - NAME - STATUS - ID - CREATED_BY - CREATED_ON - MODIFIED_BY - MODIFIED_ON Default value is MODIFIED_ON.   Delivery formats       31        Parameter Description Additional Information   sort Optional. Set the sorting order  of the search results.   Enter one of the following values: - To sort the search results by ascending order,    enter ASC. - To sort the search results by descending    order, enter DESC. Default value is DESC.   offset Optional. The starting index for  the paginated results.   Default value is 0.   limit Optional. The maximum  number of results.   Default value is 50.   Note: The API has no payload.   Example request The following example shows how you can use an API to retrieve the details of delivery format:   https://{{CDMP_URL}}/api/v1/integration/provisioning/deliveryFormats   Response When you pass the API query parameters in the REST client, the client displays a response for the parameter  values that you have entered."
                    },
                    "location": {
                        "type": "S3",
                        "s3Location": {
                            "uri": "s3://ishihara-bedrock-knowledge/DMP_2023November_(API)Reference_en.pdf"
                        }
                    }
                },
                {
                    "content": {
                        "text": "status=ACTIVE https://{{CDMP_URL}}/ is the base URL and /api/v1/integration/provisioning/deliveryTemplates  is the API endpoint.   How to call a Data Marketplace API       9    https://network.informatica.com/docs/DOC-19140      The base URL varies based on your region. The following table shows the regions and their  corresponding base URLs:   Region Base URL   United States of America https://cdgc-api.dm-us.informaticacloud.com/cdmp-marketplace/   United Kingdom https://cdgc-api.dm-uk.informaticacloud.com/cdmp-marketplace/   Canada https://cdgc-api.dm-na.informaticacloud.com/cdmp-marketplace/   Europe, Middle East, Africa (EMEA) https://cdgc-api.dm-em.informaticacloud.com/cdmp-marketplace/   Asia, Pacific https://cdgc-api.dm-ap.informaticacloud.com/cdmp-marketplace/   Japan https://cdgc-api.dm-apne.informaticacloud.com/cdmp-marketplace/   You can call Data Marketplace APIs only via Application Integration. You can call only 100 APIs per minute."
                    },
                    "location": {
                        "type": "S3",
                        "s3Location": {
                            "uri": "s3://ishihara-bedrock-knowledge/DMP_2023November_(API)Reference_en.pdf"
                        }
                    }
                }
            ]
        }
    ]
}

 

日本語でも質問を投げてみました。

「CDMPとは何ですか。特徴を3つ回答してください。」という問いかけに対して、下記の回答が返ってきました。内容はそれっぽいことを言ってくれています。※少し間違っている気もしますが

「CDMPはInformaticaのData Marketplaceの略称です。 特徴としては、\n1. データソースを検索・取得できるマーケットプレイス\n2. データを統合・変換できるインテグレーション機能\n3. セキュリティ機能が充実\nが挙げられます。」
出力の「retrievedReferences」を確認すると、Amazon S3に格納したPDFファイルを参照していることも確認できました。
[cloudshell-user@ip-10-132-39-170 tmp]$ aws bedrock-agent-runtime retrieve-and-generate \
> --input text="CDMPとは何ですか。特徴を3つ回答してください。" \
> --retrieve-and-generate-configuration type=KNOWLEDGE_BASE,knowledgeBaseConfiguration="{knowledgeBaseId=5TSXYNE1QK,modelArn=arn:aws:bedrock:us-east-1::foundation-model/anthropic.claude-v2}"
{
    "sessionId": "03a426cf-b8bc-4b44-9cb3-a0be05b200fb",
    "output": {
        "text": "CDMPはInformaticaのData Marketplaceの略称です。 特徴としては、\n1. データソースを検索・取得できるマーケットプレイス\n2. データを統合・変換できるインテグレーション機能\n3. セキュリティ機能が充実\nが挙げられます。"
    },
    "citations": [
        {
            "generatedResponsePart": {
                "textResponsePart": {
                    "text": "CDMPはInformaticaのData Marketplaceの略称です。",
                    "span": {
                        "start": 0,
                        "end": 38
                    }
                }
            },
            "retrievedReferences": [
                {
                    "content": {
                        "text": "status=ACTIVE https://{{CDMP_URL}}/ is the base URL and /api/v1/integration/provisioning/deliveryTemplates  is the API endpoint.   How to call a Data Marketplace API       9    https://network.informatica.com/docs/DOC-19140      The base URL varies based on your region. The following table shows the regions and their  corresponding base URLs:   Region Base URL   United States of America https://cdgc-api.dm-us.informaticacloud.com/cdmp-marketplace/   United Kingdom https://cdgc-api.dm-uk.informaticacloud.com/cdmp-marketplace/   Canada https://cdgc-api.dm-na.informaticacloud.com/cdmp-marketplace/   Europe, Middle East, Africa (EMEA) https://cdgc-api.dm-em.informaticacloud.com/cdmp-marketplace/   Asia, Pacific https://cdgc-api.dm-ap.informaticacloud.com/cdmp-marketplace/   Japan https://cdgc-api.dm-apne.informaticacloud.com/cdmp-marketplace/   You can call Data Marketplace APIs only via Application Integration. You can call only 100 APIs per minute."
                    },
                    "location": {
                        "type": "S3",
                        "s3Location": {
                            "uri": "s3://ishihara-bedrock-knowledge/DMP_2023November_(API)Reference_en.pdf"
                        }
                    }
                }
            ]
        },
        {
            "generatedResponsePart": {
                "textResponsePart": {
                    "text": "特徴としては、\n1. データソースを検索・取得できるマーケットプレイス\n2. データを統合・変換できるインテグレーション機能\n3. セキュリティ機能が充実\nが挙げられます。",
                    "span": {
                        "start": 40,
                        "end": 125
                    }
                }
            },
            "retrievedReferences": [
                {
                    "content": {
                        "text": "status=ACTIVE https://{{CDMP_URL}}/ is the base URL and /api/v1/integration/provisioning/deliveryTemplates  is the API endpoint.   How to call a Data Marketplace API       9    https://network.informatica.com/docs/DOC-19140      The base URL varies based on your region. The following table shows the regions and their  corresponding base URLs:   Region Base URL   United States of America https://cdgc-api.dm-us.informaticacloud.com/cdmp-marketplace/   United Kingdom https://cdgc-api.dm-uk.informaticacloud.com/cdmp-marketplace/   Canada https://cdgc-api.dm-na.informaticacloud.com/cdmp-marketplace/   Europe, Middle East, Africa (EMEA) https://cdgc-api.dm-em.informaticacloud.com/cdmp-marketplace/   Asia, Pacific https://cdgc-api.dm-ap.informaticacloud.com/cdmp-marketplace/   Japan https://cdgc-api.dm-apne.informaticacloud.com/cdmp-marketplace/   You can call Data Marketplace APIs only via Application Integration. You can call only 100 APIs per minute."
                    },
                    "location": {
                        "type": "S3",
                        "s3Location": {
                            "uri": "s3://ishihara-bedrock-knowledge/DMP_2023November_(API)Reference_en.pdf"
                        }
                    }
                }
            ]
        }
    ]
}
「anthropic.claude-v2」を有効化しておく必要があります。また、はじめは「amazon.titan-text-express-v1」で試していたのですが、非対応だとエラーが出力されました。

おわりに

Amazon Bedrockは初めてでしたが、数時間程度でお試し利用ができました。

API叩くだけで、ぱっとわかりやすい結果が返ってくるので非常に楽しかったです。また、PDFファイルも少ししか投入していないにもかかわらず、いい感じにそれっぽい回答をしてくれたことは驚きでした。

最後に、リソース削除はお忘れなく!

タイトルとURLをコピーしました