Easy web scraping with PHP February 17th, 2008 Web scraping is a technique of web development where you load a web page and 'scrape' the data off the page to be used elsewhere. It's not pretty, but sometimes scraping is the only way to access data or content from a. With just a couple of lines of PHP in each page we have turned our static pages in a dynamic web site. If we now wish to do any of the operations listed above, say move to a different style sheet, add an image to the header section, change the contact e-mail, we can now do it in a “central” location (the header and footer pages), once, and this will reflect on all the pages of our website.
- ✅ Experience using DevTools to extract selectors of elements
⭐ Make sure to check out the resources at the end of this article to learn more!
After reading this post will be able to:
- Have a functional understanding of NodeJS
- Use multiple HTTP clients to assist in the web scraping process
- Use multiple modern and battle-tested libraries to scrape the web
Understanding NodeJS: A brief introduction
As opposed to how most languages, including C and C++, deal with concurrency, which is by employing multiple threads, NodeJS makes use of a single main thread and utilizes it to perform tasks in a non-nlocking manner with the help of the Event Loop.
Putting up a simple web server is fairly simple as shown below:
If you have NodeJS installed and you run the above code by typing(without the < and >) in
node <YourFileNameHere>.js opening up your browser, and navigating to
localhost:3000, you will see some text saying, “Hello World”. NodeJS is ideal for applications that are I/O intensive.
HTTP clients: querying the web
HTTP clients are tools capable of sending a request to a server and then receiving a response from it. Almost every tool that will be discussed in this article uses an HTTP client under the hood to query the server of the website that you will attempt to scrape.
It is fairly simple to make an HTTP request with Request:
You can find the Request library at GitHub, and installing it is as simple as running
npm install request.
You can also find the deprecation notice and what this means here. If you don't feel safe about the fact that this library is deprecated, there are other options down below!
Axios is a promise-based HTTP client that runs both in the browser and NodeJS. If you use TypeScript, then Axios has you covered with built-in types.
Making an HTTP request with Axios is straight-forward. It ships with promise support by default as opposed to utilizing callbacks in Request:
If you fancy the async/await syntax sugar for the promise API, you can do that too. But since top level await is still at stage 3, we will have to make use of an async function instead:
All you have to do is call
getForum! You can find the Axios library at Github and installing Axios is as simple as
npm install axios.
Much like Axios, SuperAgent is another robust HTTP client that has support for promises and the async/await syntax sugar. It has a fairly straightforward API like Axios, but SuperAgent has more dependencies and is less popular.
Regardless, making an HTTP request with Superagent using promises, async/await, or callbacks looks like this:
You can find the SuperAgent library at GitHub and installing Superagent is as simple as
npm install superagent.
For the upcoming few web scraping tools, Axios will be used as the HTTP client.
Note that there are other great HTTP clients for web scrapinglike node-fetch!
Regular expressions: the hard way
The simplest way to get started with web scraping without any dependencies is to use a bunch of regular expressions on the HTML string that you fetch using an HTTP client. But there is a big tradeoff. Regular expressions aren't as flexible and both professionals and amateurs struggle with writing them correctly.
For complex web scraping, the regular expression can also get out of hand. With that said, let's give it a go. Say there's a label with some username in it, and we want the username. This is similar to what you'd have to do if you relied on regular expressions:
match() usually returns an array with everything that matches the regular expression. In the second element(in index 1), you will find the
textContent or the
innerHTML of the
<label>tag which is what we want. But this result contains some unwanted text (“Username: “), which has to be removed.
As you can see, for a very simple use case the steps and the work to be done are unnecessarily high. This is why you should rely on something like an HTML parser, which we will talk about next.
Cheerio: Core jQuery for traversing the DOM
Cheerio is an efficient and light library that allows you to use the rich and powerful API of jQuery on the server-side. If you have used jQuery previously, you will feel right at home with Cheerio. It removes all of the DOM inconsistencies and browser-related features and exposes an efficient API to parse and manipulate the DOM.
As you can see, using Cheerio is similar to how you'd use jQuery.
However, it does not work the same way that a web browser works, which means it does not:
- Render any of the parsed or manipulated DOM elements
- Apply CSS or load any external resource
To demonstrate the power of Cheerio, we will attempt to crawl the r/programming forum in Reddit and, get a list of post names.
First, install Cheerio and axios by running the following command:
npm install cheerio axios.
Then create a new file called
crawler.js, and copy/paste the following code:
getPostTitles() is an asynchronous function that will crawl the Reddit's old r/programming forum. First, the HTML of the website is obtained using a simple HTTP GET request with the axios HTTP client library. Then the HTML data is fed into Cheerio using the
With the help of the browser Dev-Tools, you can obtain the selector that is capable of targeting all of the postcards. If you've used jQuery, the
$('div > p.title > a') is probably familiar. This will get all the posts. Since you only want the title of each post individually, you have to loop through each post. This is done with the help of the
To extract the text out of each title, you must fetch the DOM element with the help of Cheerio (
el refers to the current element). Then, calling
text() on each element will give you the text.
Now, you can pop open a terminal and run
node crawler.js. You'll then see an array of about 25 or 26 different post titles (it'll be quite long). While this is a simple use case, it demonstrates the simple nature of the API provided by Cheerio.
JSDOM: the DOM for Node
Once a DOM is created, it is possible to interact with the web application or website you want to crawl programmatically, so something like clicking on a button is possible. If you are familiar with manipulating the DOM, using JSDOM will be straightforward.
As you can see, JSDOM creates a DOM. Then you can manipulate this DOM with the same methods and properties you would use while manipulating the browser DOM.
To demonstrate how you could use JSDOM to interact with a website, we will get the first post of the Reddit r/programming forum and upvote it. Then, we will verify if the post has been upvoted. Renweb software.
Start by running the following command to install JSDOM and Axios:
npm install jsdom axios
Then, make a file named
crawler.js and copy/paste the following code:
upvoteFirstPost() is an asynchronous function that will obtain the first post in r/programming and upvote it. To do this, axios sends an HTTP GET request to fetch the HTML of the URL specified. Then a new DOM is created by feeding the HTML that was fetched earlier.
The JSDOM constructor accepts the HTML as the first argument and the options as the second. The two options that have been added perform the following functions:
windowobject, thus preventing any script from being executed on the inside.
- resources: When set to “usable”, it allows the loading of any external script declared using the
<script>tag (e.g, the jQuery library fetched from a CDN).
Once the DOM has been created, you can use the same DOM methods to get the first post's upvote button and then click on it. To verify if it has been clicked, you could check the
classList for a class called
upmod. If this class exists in
classList, a message is returned.
Now, you can pop open a terminal and run
node crawler.js. You'll then see a neat string that will tell you if the post has been upvoted. While this example use case is trivial, you could build on top of it to create something powerful (for example, a bot that goes around upvoting a particular user's posts).
If you dislike the lack of expressiveness in JSDOM and your crawling relies heavily on such manipulations or if there is a need to recreate many different DOMs, the following options will be a better match.
Puppeteer: the headless browser
Puppeteer, as the name implies, allows you to manipulate the browser programmatically, just like how a puppet would be manipulated by its puppeteer. It achieves this by providing a developer with a high-level API to control a headless version of Chrome by default and can be configured to run non-headless.
Taken from the Puppeteer Docs (Source)
Easy Web Scraping Using Php
Puppeteer is particularly more useful than the aforementioned tools because it allows you to crawl the web as if a real person were interacting with a browser. This opens up a few possibilities that weren't there before:
- You can get screenshots or generate PDFs of pages.
- You can crawl a Single Page Application and generate pre-rendered content.
- You can automate many different user interactions, like keyboard inputs, form submissions, navigation, etc.
It could also play a big role in many other tasks outside the scope of web crawling like UI testing, assist performance optimization, etc.
Quite often, you will probably want to take screenshots of websites or, get to know about a competitor's product catalog. Puppeteer can be used to do this. To start, install Puppeteer by running the following command:
npm install puppeteer
This will download a bundled version of Chromium which takes up about 180 to 300 MB, depending on your operating system. If you wish to disable this and point Puppeteer to an already downloaded version of Chromium, you must set a few environment variables.
This, however, is not recommended. Ff you truly wish to avoid downloading Chromium and Puppeteer for this tutorial, you can rely on the Puppeteer playground.
Let's attempt to get a screenshot and PDF of the r/programming forum in Reddit, create a new file called
crawler.js, and copy/paste the following code:
getVisual() is an asynchronous function that will take a screenshot and PDF of the value assigned to the
URL variable. To start, an instance of the browser is created by running
puppeteer.launch(). Then, a new page is created. This page can be thought of like a tab in a regular browser. Then, by calling
page.goto() with the
URL as the parameter, the page that was created earlier is directed to the URL specified. Finally, the browser instance is destroyed along with the page.
Once that is done and the page has finished loading, a screenshot and PDF will be taken using
When you run the code type in
node crawler.js to the terminal, after a few seconds, you will notice that two files by the names
page.pdf have been created.
Also, we've written a complete guide on how to download a file with Puppeteer. You should check it out!
Nightmare: an alternative to Puppeteer
Nightmare is another a high-level browser automation library like Puppeteer. It uses Electron but is said to be roughly twice as fast as it's predecessor PhantomJS and it's more modern.
If you dislike Puppeteer or feel discouraged by the size of the Chromium bundle, Nightmare is an ideal choice. To start, install the Nightmare library by running the following command:
npm install nightmare
Once Nightmare has been downloaded, we will use it to find ScrapingBee's website through a Google search. To do so, create a file called
crawler.js and copy/paste the following code into it:
First, a Nightmare instance is created. Then, this instance is directed to the Google search engine by calling
goto() once it has loaded. The search box is fetched using its selector. Then the value of the search box (an input tag) is changed to “ScrapingBee”.
After this is finished, the search form is submitted by clicking on the “Google Search” button. Then, Nightmare is told to wait untill the first link has loaded. Once it has loaded, a DOM method will be used to fetch the value of the
href attribute of the anchor tag that contains the link.
Finally, once everything is complete, the link is printed to the console. To run the code, type in
node crawler.js to your terminal.
That was a long read! But now you understand the different ways to use NodeJS and it's rich ecosystem of libraries to crawl the web in any way you want. To wrap up, you learned:
- ✅ HTTP clients such as Axios, SuperAgent, Node fetch and Request are used to send HTTP requests to a server and receive a response.
- ✅ Puppeteer and Nightmare are high-level browser automation libraries, that allow you to programmatically manipulate web applications as if a real person were interacting with them.
While this article tackles the main aspects of web scraping with NodeJS, it does not talk about web scraping without getting blocked.
If you want to learn how to avoid getting blocked, read our complete guide, and if you don't want to deal with this, you can always use our web scraping API.
Would you like to read more? Check these links out:
Web Scraping Using Php Curl
- NodeJS Website - Contains documentation and a lot of information on how to get started.
- Puppeteer's Docs - Contains the API reference and guides for getting started.
- Playright An alternative to Puppeteer, backed by Microsoft.
- ScrapingBee's Blog - Contains a lot of information about Web Scraping goodies on multiple platforms.