Skip to main content
Version: Next

Dataset format: Multi-modal causal language modeling

Dataset format

The data for a multi-modal causal language modeling experiment needs to be in a ZIP file (1) containing a JSON file (2), a JSONL file (3) and an image folder (4).

folder_name.zip (1)
│ └───json_name.json (2)
│ └───jsonl_name.jsonl (3)
│ │
│ └───image_folder_name (4)
│ └───image_name1.image_extension1
│ └───image_name2.image_extension2
│ ...
Note

You can include multiple JSON and JSONL files within the ZIP archive, which can serve as your train, validation, and test datasets.

  • The train JSON and JSONL files need to follow the format described below.
  • The validation JSON and JSONL files need to follow the same format as the train JSON and JSONL files.
  • The test JSON and JSONL files need to follow the same format as the train JSON and JSONL files, but they do not require labels.
  1. The dataset connectors require the data for a multi-modal causal language modeling experiment to be provided in a ZIP file.

    Note

    To learn how to upload your ZIP file as your dataset in H2O Hydrogen Torch, see Dataset connectors.

  2. A JSON file containing meta data of a dataset. This file includes the following fields:

    • root: Specifies the path to the image folder.
      Note
      • Image names are provided in the JSONL file. Ensure that the combination of root and image names forms the full directory path to the images.
    • annotation: Specifies the full directory path of the corresponding JSONL file.
    • data_augment: It should be set to false.
    • repeat time: Defaults to 1.
      Note

      If repeat time is less than 1, for example, 0.5, only the first half of the dataset will be used. If repeat time is greater than 1, for example, 2, the dataset will be duplicated accordingly.

    • length: This field specifies the total number of samples in the dataset.


    {
    "dataset_name": {
    "root": "directory_to_image_folder/",
    "annotation": "directory_to_jsonl_file/jsonl_name.jsonl",
    "data_augment": false,
    "repeat_time": 1,
    "length": 2500
    }
    }

  3. A JSONL file containing the actual dataset, where each line represents a single data sample in the form of a dictionary with the following fields:

    • id: This field specifies the unique identifier of the sample.

    • image: This field specifies the name of the image associated with the sample.

    • conversations: This field specifies a list of two dictionaries, each representing a specific role and its corresponding value:

      • The first dictionary includes:
        • 'from': 'human': Indicates the role of the human.
        • 'value': '<image>\n some text': A query or prompt, where <image> is a placeholder for image information followed by accompanying text.
      • The second dictionary includes:
        • 'from': 'gpt': Indicates the role of the model.
        • 'value': 'some text': The response or label generated by the model.
      Note
      • The two roles, "human" and "gpt", are fixed.
      • The <image> token serves as a special placeholder indicating where image information should be integrated within the text. For optimal performance, include this token at the beginning of the text.

    {'id': '11', 'image': '495.jpg', 'conversations': [{'from': 'human', 'value': '<image>\nwhat is written in the image?'}, {'from': 'gpt', 'value': 'THE'}]}
    {'id': '22', 'image': '496.jpg', 'conversations': [{'from': 'human', 'value': '<image>\nwhat is written in the image?'}, {'from': 'gpt', 'value': 'PRO'}]}
    ...

  4. An image folder containing all the images referenced in the JSONL files.

    Note

    All image file names need to specify an image extension. Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.

Example

The sroie_multimodal_causal_language_modeling.zip file is a preprocessed dataset in H2O Hydrogen Torch and was formatted to solve a multi-modal causal language modeling problem. The structure of the ZIP file is as follows:

sroie_multimodal_causal_language_modeling.zip
│ └───train.json
│ └───train.jsonl
│ │
│ └───valid.json
│ └───valid.jsonl
│ │
│ └───img
│ └───X00016469612.jpg
│ └───X00016469619.jpg
│ └───X00016469620.jpg
│ ...

The content of the train.json file is as follows:

{
"SROIE_train": {
"root": "./",
"annotation": "train.jsonl",
"data_augment": false,
"repeat_time": 1,
"length": 500
}
}

The first two lines of the train.jsonl file are as follows:

{"id": 0, "image": "img/X00016469612.jpg", "conversations": [{"from": "human", "value": "<image>\n\nExtract the following information from receipt document.\n### Understanding the receipt Layout and output fields:\n1. Company: This is the name of the business or organization associated with the document. It typically represents the entity that issued the invoice or receipt.\n2. Date: This is the date when the document was issued or the transaction occurred. It helps in tracking and organizing documents chronologically.\n3. Address: This refers to the physical location of the company or the place where the transaction took place. It usually includes street address, city, and postal code.\n4. Total: This represents the total amount or cost indicated in the document. It is usually found at the end of invoices or receipts and is important for financial records.\n### Provide your answers in the following format:\n{\n    \"company\": \"\",\n    \"date\": \"\",\n    \"address\": \"\",\n    \"total\": \"\"\n    }\n"}, {"from": "gpt", "value": "{\n    \"company\": \"BOOK TA .K (TAMAN DAYA) SDN BHD\",\n    \"date\": \"25/12/2018\",\n    \"address\": \"NO.53 55,57 & 59, JALAN SAGU 18, TAMAN DAYA, 81100 JOHOR BAHRU, JOHOR.\",\n    \"total\": \"9.00\"\n}"}]}
{"id": 1, "image": "img/X00016469619.jpg", "conversations": [{"from": "human", "value": "<image>\n\nExtract the following information from receipt document.\n### Understanding the receipt Layout and output fields:\n1. Company: This is the name of the business or organization associated with the document. It typically represents the entity that issued the invoice or receipt.\n2. Date: This is the date when the document was issued or the transaction occurred. It helps in tracking and organizing documents chronologically.\n3. Address: This refers to the physical location of the company or the place where the transaction took place. It usually includes street address, city, and postal code.\n4. Total: This represents the total amount or cost indicated in the document. It is usually found at the end of invoices or receipts and is important for financial records.\n### Provide your answers in the following format:\n{\n \"company\": \"\",\n \"date\": \"\",\n \"address\": \"\",\n \"total\": \"\"\n }\n"}, {"from": "gpt", "value": "{\n \"company\": \"INDAH GIFT & HOME DECO\",\n \"date\": \"19/10/2018\",\n \"address\": \"27, JALAN DEDAP 13, TAMAN JOHOR JAYA, 81100 JOHOR BAHRU, JOHOR.\",\n \"total\": \"60.30\"\n}"}]}
...

The content of the valid.json file is as follows:

{
"SROIE_valid": {
"root": "./",
"annotation": "valid.jsonl",
"data_augment": false,
"repeat_time": 1,
"length": 126
}
}

The first two lines of the valid.jsonl file are as follows:

{"id": 0, "image": "img/X51007339162.jpg", "conversations": [{"from": "human", "value": "<image>\n\nExtract the following information from receipt document.\n### Understanding the receipt Layout and output fields:\n1. Company: This is the name of the business or organization associated with the document. It typically represents the entity that issued the invoice or receipt.\n2. Date: This is the date when the document was issued or the transaction occurred. It helps in tracking and organizing documents chronologically.\n3. Address: This refers to the physical location of the company or the place where the transaction took place. It usually includes street address, city, and postal code.\n4. Total: This represents the total amount or cost indicated in the document. It is usually found at the end of invoices or receipts and is important for financial records.\n### Provide your answers in the following format:\n{\n    \"company\": \"\",\n    \"date\": \"\",\n    \"address\": \"\",\n    \"total\": \"\"\n    }\n"}, {"from": "gpt", "value": "{\n    \"company\": \"SANYU STATIONERY SHOP\",\n    \"date\": \"02/12/2017\",\n    \"address\": \"NO. 31G&33G, JALAN SETIA INDAH X ,U13/X 40170 SETIA ALAM\",\n    \"total\": \"8.70\"\n}"}]}
{"id": 1, "image": "img/X51007339164.jpg", "conversations": [{"from": "human", "value": "<image>\n\nExtract the following information from receipt document.\n### Understanding the receipt Layout and output fields:\n1. Company: This is the name of the business or organization associated with the document. It typically represents the entity that issued the invoice or receipt.\n2. Date: This is the date when the document was issued or the transaction occurred. It helps in tracking and organizing documents chronologically.\n3. Address: This refers to the physical location of the company or the place where the transaction took place. It usually includes street address, city, and postal code.\n4. Total: This represents the total amount or cost indicated in the document. It is usually found at the end of invoices or receipts and is important for financial records.\n### Provide your answers in the following format:\n{\n \"company\": \"\",\n \"date\": \"\",\n \"address\": \"\",\n \"total\": \"\"\n }\n"}, {"from": "gpt", "value": "{\n \"company\": \"SANYU STATIONERY SHOP\",\n \"date\": \"26/11/2017\",\n \"address\": \"NO. 31G&33G, JALAN SETIA INDAH X ,U13/X 40170 SETIA ALAM\",\n \"total\": \"9.30\"\n}"}]}
...
Note

To learn how to access one of the preprocessed datasets in H2O Hydrogen Torch, see Demo (preprocessed) datasets.


Feedback