Why integrate LangChain with Apify Actors?
Actors are serverless cloud programs that can do almost anything a human can do in a web browser. By integrating them with LangChain, you can feed results from Actors directly to LangChain’s vector indexes to build apps that query data crawled from websites such as documentation or knowledge bases, chatbots for customer support, and a lot more.
Let’s take, for example, Website Content Crawler. This Actor can deeply crawl websites and extract text content from the web pages. Its results can help you feed, fine-tune or train your LLMs or provide context for prompts for ChatGPT.
How to integrate Website Content Crawler with LangChain
Step 1. Install all dependencies
pip install apify-client langchain openai chromadb
Step 2. Import os
, Document
, VectorstoreIndexCreator
, and ApifyWrapper
into your source code:
import os
from langchain.document_loaders.base import Document
from langchain.indexes import VectorstoreIndexCreator
from langchain.utilities import ApifyWrapper
Step 3. Find your Apify API token and OpenAI API key and initialize these into your environment variable:
os.environ["OPENAI_API_KEY"] = "Your OpenAI API key"
os.environ["APIFY_API_TOKEN"] = "Your Apify API token"
Step 4. Run the Actor, wait for it to finish, and fetch its results from the Apify dataset into a LangChain document loader:
loader = apify.call_actor(
actor_id="apify/website-content-crawler",
run_input={"startUrls": [{"url": "https://python.langchain.com/en/latest/"}]},
dataset_mapping_function=lambda item: Document(
page_content=item["text"] or "", metadata={"source": item["url"]}
),
)
Note: The Actor call function can take some time as it loads the data from the LangChain documentation website.
If you already have some results in an Apify dataset, you can load them directly using ApifyDatasetLoader
, as shown in this notebook. In that notebook, you'll also find the explanation of the dataset_mapping_function
, which is used to map fields from the Apify dataset records to LangChain Document
fields.
Step 5. Initialize the vector index from the crawled documents:
index = VectorstoreIndexCreator().from_loaders([loader])
Step 6. Query the vector index:
query = "What is LangChain?"
result = index.query_with_sources(query)
print(result["answer"])
print(result["sources"])
The query produces an output like this:
LangChain is a framework for developing applications powered by language models. It is designed to connect a language model to other sources of data and allow it to interact with its environment.
”https://python.langchain.com/en/latest/”
If you want to see what other GPT and AI-enhanced tools you could integrate with LangChain, have a browse through Apify Store.
The video tutorial below shows you how to use LangChain with Apify Blog Scraper, which will help you understand how to integrate LangChain with any scraper you choose.