Skip to main content
Version: v1.6.40-dev2 🚧

Import from URL

Overview​

The Import from URL method allows you to import documents directly from web URLs. This method can crawl websites and extract content, making it ideal for importing web-based documents, articles, and other online content.

Import from URL

When to use​

  • Web content: When you need to import documents from websites
  • Online resources: For importing articles, documentation, or reports
  • Dynamic content: When content is regularly updated online
  • Public documents: For importing publicly available web documents
  • Research materials: When gathering content from multiple web sources

Configuration parameters​

URL and crawling settings​

OptionDefaultDescriptionUse case
URL- (text input)The web page URL to import fromSpecify the starting point for content import
Follow linksOff (can be toggled on or off)Whether to crawl linked pages from the same domainEnable for importing entire websites or documentation sections
Max documents1 (number input)Maximum number of pages to import during crawling1: Import only the specified page
5-10: Import a small section of related pages
50+: Import large documentation sites or blog series
note

For document processing options, see the Shared Document Processing Options section in the main documentation.

Website crawling behavior​

Single page import​

  • Follow links: Off
  • Max documents: 1 (limits crawling to only the specified page)
  • Result: Only the specified page is imported
  • Use case: Specific articles, documentation pages, or reports

Multi-page crawling​

  • Follow links: On
  • Max documents: Set to desired limit (controls how many pages are imported)
  • Result: Multiple pages from the same domain are imported, up to the specified limit
  • Use case: Entire documentation sites, blog series, or website sections

Crawling rules​

  • Same domain: Only pages from the same domain are crawled
  • Respect robots.txt: Crawling respects website robots.txt files
  • Rate limiting: Built-in delays to avoid overwhelming servers
  • Duplicate detection: Automatically avoids importing duplicate content
  • Terms of service: Respect website terms of service and robots.txt
  • Copyright: Ensure you have permission to import and use content
  • Rate limiting: Avoid overwhelming servers with too many requests
  • Data privacy: Be mindful of personal information that may be present

Feedback