PHP curl 타고타고 크롤링 (navigating and scraping multiple pages)

2020. 7. 17. 14:20PHP

728x90

http://www.jacobward.co.uk/navigating-and-scraping-multiple-pages-with-php-curl-part-3/

 

Navigating And Scraping Multiple Pages With PHP & CURL [Part 3] – Jacob Ward

If we take our scraper script so far, we can perform a basic search on IMDb and scrape the single page of results that is returned for the movies’ URLs. But what if we want to scrape all of the results pages? What if we then want to scrape all of the res

www.jacobward.co.uk

 

 

If we take our scraper script so far, we can perform a basic search on IMDb and scrape the single page of results that is returned for the movies’ URLs.

But what if we want to scrape all of the results pages? What if we then want to scrape all of the results for their specific attributes, such as movie name, release date, description, director and so on…?

Well, that’s what we’ll be covering today. Using PHP and cURL to navigate the results pages and scrape multiple pages of the website for data and organise that data into a logical structure for further use.

So, our first task is to get the URLs from all of the results pages. This involves evaluating whether there is another page of results and, if there is, visiting it, scraping the results URLs and adding them to our array.

If we take our script from last time and include our scrape_between() and curl() functions, we need to make the following changes to the script. Don’t worry, I’ll talk the through after.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

<?php

     

    $continue = TRUE;   // Assigning a boolean value of TRUE to the $continue variable

     

    $url = "http://www.imdb.com/search/title?genres=action";    // Assigning the URL we want to scrape to the variable $url

     

    // While $continue is TRUE, i.e. there are more search results pages

    while ($continue == TRUE) {

         

        $results_page = curl($url); // Downloading the results page using our curl() funtion

 

        $results_page = scrape_between($results_page, "<div id=\"main\">", "<div id=\"sidebar\">"); // Scraping out only the middle section of the results page that contains our results

         

        $separate_results = explode("<td class=\"image\">", $results_page);   // Exploding the results into separate parts into an array

         

        // For each separate result, scrape the URL

        foreach ($separate_results as $separate_result) {

            if ($separate_result != "") {

                $results_urls[] = "http://www.imdb.com" . scrape_between($separate_result, "href=\"", "\" title="); // Scraping the page ID number and appending to the IMDb URL - Adding this URL to our URL array

            }

        }

 

        // Searching for a 'Next' link. If it exists scrape the url and set it as $url for the next loop of the scraper

        if (strpos($results_page, "Next&nbsp;&raquo;")) {

            $continue = TRUE;

            $url = scrape_between($results_page, "<span class=\"pagination\">", "</span>");

            if (strpos($url, "Prev</a>")) {

                $url = scrape_between($url, "Prev</a>", ">Next");

            }

            $url = "http://www.imdb.com" . scrape_between($url, "href=\"", "\"");

        } else {

            $continue = FALSE;  // Setting $continue to FALSE if there's no 'Next' link

        }

        sleep(rand(3,5));   // Sleep for 3 to 5 seconds. Useful if not using proxies. We don't want to get into trouble.

    }

?>

First up we retrieve the initial results page. Then we scrape all of the results and add them to the array $results_urls. Then we check to see if there is a “Next” link to another page of results, if there is then we scrape that and loop through the script to repeat the scraping of results from the next page. The loop iterates and continues to visit the next page, scraping the results, until there are no more pages of results.

Now we have an array with all of the results URLs, for which we can do a foreach() over to visit each URL and scrape the results. I’ll leave that to you, with what we’ve covered so far it should be easy to figure out.

I’ll get you started:

1

2

3

4

5

foreach($results_urls as $result_url) {

    // Visit $result_url (Reference Part 1)

    // Scrape data from page (Reference Part 1)

    // Add to array or other suitable data structure (Reference Part 2)

}

In the next post in the series I’ll post up the code you should have got and then we’ll cover downloading images and other files.

Up next time: Downloading Images And Files With PHP & CURL

All Posts From This Series

Posted on May 25, 2012 by Jacob Ward This entry was posted in PHP, Programming, Resources, Tutorials, Web Scraping, Web Scraping With PHP & CURL. Bookmark the permalink.

728x90
반응형