Because it is essential to process step by step, I firstly wrote a post on web scraping. That post introduced the concept of web scraping applying to real estate data, knowing the value of web scraping and doing the simple process to attempt a result, i.e. going from a website, extracting some valuable information and storing it in a Dataframe. I navigate in the web scraping area and extract data from one page. You can get more information about it here.
Getting only a data frame with a small amount of data cannot be relevant/consistent for any analysis. So, in this post, I will widen the field of vision and get more data from web scraping multiples pages.
Web scraping multiple pages can be challenging. I found myself in this tough situation, and the purpose of this post it's to deal with this issue.
What is Pagination?
As we all know, websites are full of data. But, Web developers, while developing a website, cannot display all the data on one page, which is why the information is available on multiples pages.
There are multiple types of paginations. But, in this post, I will focus on the one present on the real estate website seloger.com I'm working with.
That one is called Pagination with Next button. Here is an example. Note that the website is in French.
The URL will change when we click on every page. And, it is crucial to be aware that this website doesn't have only nine pages.
Building page extractor
The libraries used:
I use different libraries for this project:
- BeautifulSoup: This library makes it easy to scrape information from web pages and analyse HTML documents.
- Pandas: This library helps with data manipulation. We will store the extracted data in a structured format.
- Requests: It supports HTTP requests.
Get the pages
As we mentioned, the data are available on seloger.com, and we need to get the URL of the page. I recommend using chrome while doing web scraping.
url = "https://www.seloger.com/list.htm?projects=2,5&types=2,1&natures=1,2,4&places=[{%22inseeCodes%22:[60088]}]&rooms=2&mandatorycommodities=0&enterprise=0&qsVersion=1.0&LISTING-LISTpg=1"
When we look at this URL and start playing with the value of "LISTING-LISTpg" we find that the number after the equal symbol sends us back to the page.
Option 1 :
The first simple solution it's to know the last page of the website and create a loop to get all the elements.
list_all_ads = []
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/39.0.2171.95 Safari/537.36'}
for page in range(1, 70):
url = "https://www.seloger.com/list.htm?projects=2,5&types=2,1&natures=1,2,4&places=[{%22inseeCodes%22:[" \
"60088]}]&rooms=2&mandatorycommodities=0&enterprise=0&qsVersion=1.0LISTING-LISTpg="
response = requests.get(url + str(page), headers=headers)
soup = BeautifulSoup(response.text, "lxml")
for element in soup.find_all('div', attrs={'class': 'ListContent-sc-1viyr2k-0 klbvnS '
'classified__ClassifiedContainer-sc-1wmlctl-0 dlUdTD '
'Card__CardContainer-sc-7insep-5 kLpWdA'}):
price = element.find('div', attrs={'data-test': 'sl.price-label'}).text # => price.text return the price value
agency_link = element.find_all('div', {'class': 'Contact__ContentContainer-sc-3d01ca-2 cKwmCO'})
for agency in agency_link:
agency_name = agency.a.text
agency_name_value = agency_name
type = element.find('div', attrs={'data-test': 'sl.title'}).text
address = element.find('div', attrs={'data-test': 'sl.address'}).text
ul_tagsLine_0 = element.find('ul', attrs={'data-test': 'sl.tagsLine_0'})
list_ul_tagsLine_0 = []
for li in ul_tagsLine_0.find_all("li"):
list_ul_tagsLine_0.append(li.text)
numbers_of_pieces, rooms, size = getLiValue(list_ul_tagsLine_0)
# print('\n')
list_all_ads.append({"price": price, "agency_name": agency_name_value, "type": type, "address": address,
"numbers_of_pieces": numbers_of_pieces, "rooms": rooms, "size": size})
# print(list_all_ads)
df_seloger = pd.DataFrame(list_all_ads)
df_seloger.to_csv('listings.csv', index=False, encoding='utf-8')
Option 2 :
The second method is to inspect the next button and extract data until the button disappears. When the next button disappears from the website, the information we can get from it is that we are on the last page.
First, inspect the HTML tags. To do this, right-click on the button and click on Inspect. We get this:
We can see that "Suivant" is in a span tag and a global div tag.
suivant = soup.find('div', attrs={'data-testid': 'gsl.uilib.Paging'}).find('span', attrs={'class': 'sc-bxivhb bUZGRr'}
The code above shows us how to find the "Suivant".
list_all_ads = []
nextPage = True
page=1
while(nextPage):
agency_name_value = []
url = "https://www.seloger.com/list.htm?projects=2,5&types=2,1&natures=1,2,4&places=[{%22inseeCodes%22:[" \
"60088]}]&rooms=2&mandatorycommodities=0&enterprise=0&qsVersion=1.0&LISTING-LISTpg="
response = requests.get(url + str(page), headers=headers)
soup = BeautifulSoup(response.text, "lxml")
for element in soup.find_all('div', attrs={'class': 'ListContent-sc-1viyr2k-0 klbvnS '
'classified__ClassifiedContainer-sc-1wmlctl-0 dlUdTD '
'Card__CardContainer-sc-7insep-5 kLpWdA'}):
price = element.find('div', attrs={'data-test': 'sl.price-label'}).text # => price.text return the price value
agency_link = element.find_all('div', {'class': 'Contact__ContentContainer-sc-3d01ca-2 cKwmCO'})
for agency in agency_link:
agency_name = agency.a.text
agency_name_value = agency_name
type = element.find('div', attrs={'data-test': 'sl.title'}).text
address = element.find('div', attrs={'data-test': 'sl.address'}).text
ul_tagsLine_0 = element.find('ul', attrs={'data-test': 'sl.tagsLine_0'})
list_ul_tagsLine_0 = []
for li in ul_tagsLine_0.find_all("li"):
list_ul_tagsLine_0.append(li.text)
numbers_of_pieces, rooms, size = getLiValue(list_ul_tagsLine_0)
# print('\n')
list_all_ads.append({"price": price, "agency_name": agency_name_value, "type": type, "address": address,
"numbers_of_pieces": numbers_of_pieces, "rooms": rooms, "size": size})
suivant = soup.find('div', attrs={'data-testid': 'gsl.uilib.Paging'}).find('span',
attrs={'class': 'sc-bxivhb bUZGRr'})
if suivant is None:
nextPage = False
page+= 1
time.sleep(3)
df_seloger = pd.DataFrame(list_all_ads)
df_seloger.to_csv('listings.csv', index=False, encoding='utf-8')
Well done!
DISCLAIMER: You may still get blocked if the server detects you are trying to scrape large amounts of data with your script;
Conclusion
This article explains what pagination is and how to get data from multiples pages.
I hope this information was helpful and exciting. If you have any questions or want to say hi, I'm happy to connect and respond to your questions about my blogs! Feel free to visit my website for more!