Web scraping on Allociné DVD releases pages

During my entrepreneurial adventure (with the cultural agenda), I had to regularly update new events. A tedious task that could be partly automated. So, I had asked a quote for web scrape events published on cultural agendas.

Knowing that I was comfortable with HTML and CSS with some basics in Python, one of the developers told me that I should be able to « easily » scrape most agendas. But i have to confess, it was too new for me, my time was running out, and I had other tasks to do, so I gave up the idea (maybe too quickly).

A year has passed since and my progress in Python changed the game. What seemed too far to me at this time was now enjoyable. A lovely feeling!

Web scraping movies on Allociné

Regularly, I consult the last DVD releases on Allociné. So i wanted to list all these films with: title, press rating, spectators rating, actors, summary, url. 

This is a first webscraping exercise, and yes it was really fun to do!

0. Import BeautifulSoup library

# Import Librairies
import requests from bs4 
import BeautifulSoup
import pandas as pd
import numpy as np

1. Collect datas from the url

# create the list movie_cards, with as many nested lists as pages
pages = 25
url_param = 'http://www.allocine.fr'
fill_val = ''
movies_cards = []

for page in range(pages):
    url = f'http://www.allocine.fr/dvd/meilleurs/?page={page}'
    r = requests.get(url)
    allocine = BeautifulSoup(r.text, 'lxml')
    cards = allocine.find_all('div', class_='entity-card-list')
    movies_cards.append(cards)

2. Process datas

# fill our movie lists
movie_title = []
movie_url = []
movie_acteurs = []
movie_resume = []
movie_presse = []
movie_spectateurs = []
movie_amis = []

for pages in movies_cards:
    for card in pages:
        # add movies title
        movie_title.append(card.find('a', class_='meta-title-link', href=True).text)

        # add movies url
        movie_url.append(url_param + card.find('a', class_='meta-title-link', href=True)['href'])
        
        # add movies actors
        if 'meta-body-actor' in str(card):
            actors_blocs = card.find('div', class_='meta-body-actor').text.splitlines()
            actors_list = []
            
            for list_elements in actors_blocs:
                m = list_elements.strip()
                if m != '' and m != 'Avec':
                    actors_list.append(m)
            movie_acteurs.append(' '.join(actors_list))
            
        else:
            movie_acteurs.append(fill_val)
        
        # add movies resume
        if 'content-txt' in str(card):
            movie_resume.append(card.find('div', class_='content-txt').text.strip())
        else:
            movie_resume.append(fill_val)
        
        # add press, spectators and friends movie ratings
        if 'rating-item' in str(card):
            rating = card.find_all('div', class_='rating-item')
            i = 0
            tour = 1
            for elmt in rating:
                # we have to analyse each blocs separatly (cause same span/div class names)
                if tour%3 == 1:
                    if 'Presse' in str(rating[i]):
                        u = elmt.find_all('span', class_="stareval-note")[0].text
                        movie_presse.append(u)
                    else:
                        movie_presse.append(fill_val)
                        tour += 1
           
                if tour%3 == 2:
                    if 'Spectateur' in str(rating[i]):
                        v = elmt.find_all('span', class_="stareval-note")[0].text
                        movie_spectateurs.append(v)
                    else: 
                        movie_spectateurs.append(fill_val)
            
                if tour%3 == 0:
                    if 'Amis' in str(rating[i]):
                        v = elmt.find_all('span', class_="stareval-note")[0].text
                        movie_amis.append(v)
                    else:
                        movie_amis.append(fill_val)
            
                i += 1
                tour += 1
        else:
            movie_presse.append(fill_val)
            movie_spectateurs.append(fill_val)
            movie_amis.append(fill_val)

3. Group lists into a dataframe

# create our dataframe
allocine_recent_movies = pd.DataFrame(
    {'Title': movie_title,
     'Press Rating' : movie_presse,
     'Spectator Rating' : movie_spectateurs,
     'Actors': movie_acteurs,
     'Resume' : movie_resume,
     'Allociné URL' : movie_url
    })

# set full display on texts
pd.set_option('display.max_colwidth', -1)

# let's take a look
allocine_recent_movies.head()
Web scraping movies on Allocine

4. Export datas

# dataframe to HTML
allocine_recent_movies.to_html('.../190814_allocine_recent_movies.html')

# dataframe to CSV
allocine_recent_movies.to_csv('.../190814_allocine_recent_movies.csv')

Here is the HTML file (scraping done the 14th august 2019). 374 films, nice job BeautifulSoup!

- - -
Hope this post can be useful for you. Feel welcome to share your webscraping code in the comments!
- - -

Laisser un commentaire

Votre adresse de messagerie ne sera pas publiée. Les champs obligatoires sont indiqués avec *