URL #

This covers how to load HTML documents from a list of URLs into a document format that we can use downstream.

 from langchain.document_loaders import UnstructuredURLLoader

urls = [
    "https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-february-8-2023",
    "https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-february-9-2023"
]

loader = UnstructuredURLLoader(urls=urls)

data = loader.load()

Selenium URL Loader #

This covers how to load HTML documents from a list of URLs using the SeleniumURLLoader .

Using selenium allows us to load pages that require JavaScript to render.

Setup #

To use the SeleniumURLLoader , you will need to install selenium and unstructured .

from langchain.document_loaders import SeleniumURLLoader

urls = [
    "https://www.youtube.com/watch?v=dQw4w9WgXcQ",
    "https://goo.gl/maps/NDSHwePEyaHMFGwh8"
]

loader = SeleniumURLLoader(urls=urls)

data = loader.load()

Playwright URL Loader #

This covers how to load HTML documents from a list of URLs using the PlaywrightURLLoader .

As in the Selenium case, Playwright allows us to load pages that need JavaScript to render.

Setup #

To use the PlaywrightURLLoader , you will need to install playwright and unstructured . Additionally, you will need to install the Playwright Chromium browser:

# Install playwright
!pip install "playwright"
!pip install "unstructured"
!playwright install

from langchain.document_loaders import PlaywrightURLLoader

urls = [
    "https://www.youtube.com/watch?v=dQw4w9WgXcQ",
    "https://goo.gl/maps/NDSHwePEyaHMFGwh8"
]

loader = PlaywrightURLLoader(urls=urls, remove_selectors=["header", "footer"])

data = loader.load()

Unstructured File Web Base