Web Scraper is an easy-to-use but powerful tool that is ideal for getting started on Apify and small or big websites. But its dedication to crawling doesn't allow it to post-process the data or perform any arbitrary job with them, because all your code runs inside a web browser. Then it is time to look into "raw" Apify Actors.
Actors can be used to run arbitrary code and implement very complicated workflows that are impossible inside a Web Scrapers' page function. You can send the data to a database, as an email attachment, run them through a demanding computations or just simply modify them and save them again.
We will assume that you have already a working Web Scraper that outputs some data. If not, read this awesome tutorial and scrape some data. Now, let's move to the Actor. Let's start by creating a new Actor, naming it "postprocessing" and then heading to the API tab where you can find the Run Actor API endpoint URL. Copy this to your clipboard...
... and return to our Web Scraper.
Click on Webhooks tab and paste the copied URL here. Choose event "ACTOR.RUN.SUCCEEDED", and press save.
Webhooks UI is constantly improved so you might see a slightly different image, don't worry about payload, etc.
Let's move to our Actor again. After the Web Scraper finishes, it will call the run endpoint and the Actor will start with the input it gets from the webhook. You can check the documentation on how the full input from webhook looks like.
We are interested in the "resource.defaultDatasetId" property, which is the id of the dataset where the scraped data were saved. We can then upload the data to our Actor using the Apify client. It is pretty simple.
const response = await Apify.client.datasets.getItems({
datasetId: input.resource.defaultDatasetId
})
const data = response.items
Now that we have the data loaded, we can do arbitrary computations or manipulations with it and save it again to different dataset. Let's imagine we have a list of people and we want to save only the women.
const processedData = data.filter((item) => item.gender === 'woman')
await Apify.pushData(processedData)
Now we have saved the processed data to a new dataset that we can access in the Actor interface itself.
The whole Actor would look like this
const Apify = require('apify');
Apify.main(async () => {
const input = await Apify.getValue('INPUT');
const response = await Apify.client.datasets.getItems({
datasetId: input.resource.defaultDatasetId
})
const data = response.items
const processedData = data.filter(item=> item.gender === 'woman')
await Apify.pushData(processedData)
});