Scraping projects
This project is done with Python, Selenium and Scrapy but not in the usual way because Selenium is just used for the first interaction and everything else is done with Scrapy (crawl & scrap)
site | https://www.local.ch/en |
fields | _name, _address, _phone, _mobile, _whatsapp, _email, _website |
static or dynamic | dynamic site to get the url generated by the search. beyond that, crawling and scraping will be done by scrapy (dynamically on static data) |
why? | we need the first interaction to be done with selenium (browser) |
crawler | dynamic horizontal and vertical crawler |
descrip. | after we get the url generated by the search, everything is static on the client side (no need to render javascript), therefore scrapy can crawl vertically and horizontally and scrap which makes the scraper much faster |
should be | python + selenium |
could be | selenium + scrapy |
has been | selenium + scrapy |
why? | because we just need the first interaction done with selenium (browser) and then everything is done with scrapy (crawling and scraping) |
interesting facts |
- we usually create scrapers either in selenium or in scrapy
- sometimes, we can create an hybrid selenium-scrapy where selenium crawls and scrapy scraps - in this case, we need the first interaction in selenium but selenium doesn't have to crawl horizontally and vertically because with the search url we can construct the urls and scrapy will do the job which makes the scraper even faster because selenium is doing the bare minimum |
output | csv |
sample |
output_local.ch.csv
output_local.ch.xlsx |
This project is done with Python and Selenium but needs a very special way of using Selenium to circumvent the Kasada protection
site | https://www.realestate.com.au/buy/property-house-in-riverwood,+nsw+2210/list-1?includeSurrounding=false&source=refinement |
fields | address, url, bedrooms, bathrooms, parking_space, land_size, auction_details |
static or dynamic | dynamic |
why? | Javascript need to be rendered |
crawler | no crawler here since we just scrap one just to show that we are able to circumvent Kasada protection |
descrip. | the scraper connect to a single and gets data circumventing kasada protection |
should be | python + selenium |
could be | - |
has been | python + selenium |
why? | - |
interesting facts | - the usual selenium doesn't work so I had to find a very special way to be able to circumvent kasada. in usual selenium, the page won't load. |
output | csv |
sample | output_realestate.csv |
This project is done with Visual Basic.NET and Selenium. It's a visual application that works for Windows operating system
site |
https://www.google.com/maps/search/
each search is made as a mix of city, zipcode and county, for example Mezcal Spirit Lo Angeles Search Agave Spirit Los Angeles Search |
fields | _keyword, _name, _full_address, _street_address, _city, _state, _zipcode, _country, _longitude, _latitude, _licence, _description, _services, _website, _phone, _wheelchair_accessible_entrance, _google_maps_url, _hours_of_operation, _monday_from, _monday_to, _tuesday_from, _tuesday_to, _wednesday_from, _wednesday_to, _thursday_from, _thursday_to, _friday_from, _friday_to, _saturday_from, _saturday_to, _sunday_from, _sunday_to |
static or dynamic | dynamic |
why? | Google Maps will simply not work without Javascript plus we must reach the best level of zoom (street level) |
crawler | the crawler is static and dynamic: static in the sense the scraper uses a static list of cities, counties and zipcodes and dynamic in the sense that when we make a search, we crawl the results as an infinite loop to open a popup for each result |
descrip. | the crawler reads a static list of urls, it creates searches, for each search it get results (it opens the infinite loop until reaching the end of the list) and for each result, it opens a popup |
should be | visual basic.net + selenium |
could be | - |
has been | visual basic.net + selenium |
why? | - |
interesting facts |
- the site is done with Visual Basic.NET instead of the usual Python and it works with MS-Access as database instead of MySQL or a PostgreSQL database. technologies used here are simply different but we are still using selenium
- the crawler is NOT static or dynamic: it is both static on one hand (making the search itself) and dynamic on the other hand (crawling itself opening the infinite loop) - the scraper works at the best level of zoom of Google Maps (street level) - I can create an installer (setup.exe) to install it on any computer with Windows operating system (access database is part of the install but to make it work you must use Microsoft Office or an Open Source solution such as Open Office or Libre Office) - the scraper is optimized to avoid repetitions: when a result has already been treated, the popup won't be opened again, it makes the scraper faster - the crawler doesn't paginate, it deals with an infinite loop: it's some kinf of vertical crawling instead of horizontal crawling (pagination) |
output | mdb, csv |
sample |
google_maps_liquors.csv
google_maps_liquors.mdb |
This project is done with Python and Scrapy, deals with an unsual pagination and creates a folders structure to download products images
site | https://www.mezcalreviews.com/filter-by/brand/ |
fields | product_to_scrap_url, brand_url, brand_name, brand_image_url, brand_directory, product_name, product_image_url, product_new_image_name, product_decription, category, cost, brand, mezcalero, maguey, agave, grind, fermentation, milling, distillation, style, state, town, ABV, website |
static or dynamic | static |
why? | - |
crawler | the crawler is static but there is no next button without or without internal url, therefore we must find an internal way to paginate each brand to get access to all its products. |
descrip. | the scraper paginates constructing internal static urls |
should be | python + scrapy |
could be |
python + requests + beautiful soup
python + requests + lxml |
has been | python + scrapy |
why? | - |
interesting facts |
- the site is static: we can work with python + requests + (beautiful soup or lxml) or even better: scrapy.
- there is no first, previous, next, last button. we could have paginated through all pages of all brands not going through brands but in that case, the pagination goes by interval. working brand by brand is better because we can structure the brand image and the products images when downloading them. - the scraper downloads brands and products images and structure them in directories with the name of each brand. the brand image is always ordered in first position (begins with _) |
output | csv |
sample |
output_mezcalreviews.csv
images_mezcalreviews.rar images_mezcalreviews.zip |
This project is done with Python and Selenium where we find a way to avoid having to crawl from page 1 each time we get a record
site | https://ilv.ifa.dguv.de/substances |
fields | substance_name, cas_no, remark, country, twa_ppm, twa_mg_m3, twa_f_cm3, stel_ppm, stel_mg_m3, stel_f_cm3 |
static or dynamic | dynamic |
why? | because, we must click on a button to open all entries and click on each link |
crawler | dynamic |
descrip. |
when clicking on the next page, we can access a xml file with a list of ids sufficient to create the urls we must scrap. we identify dynamically the number of pages based on 48 results per page.
this allows us to avoid having to paginate plus when we click on an entry and go back to the list of substances, we always get back to page 1 which generates a lot of pagination because each time we want to get the data of an entry of a certain page, we must reach that page and paginate as many times as needed to get there. |
should be | python + selenium |
could be | - |
has been | python + selenium |
why? | - |
interesting facts |
- the fact that to get an entry data (component) in a certain page means paginating to that page implies that to get 48 entries in 48 pages, I must make 54144 requests ( sum_operator(2..47) x 48 = 1128 * 48) which is pretty inefficient and that is what the ids in the xml files resolves.
- the rows goes 3 by 3: the first one is a value, the second one must be clicked to make the third one appear and get its value. it seems static but that third value is dynamic. |
output | csv, xlsx |
sample |
output_substances.csv
output_substances.xlsx |
This project is done with Python and Scrapy instead of using Python and Selenium using a static crawler (list of urls)
site | https://www.bexio.com/en-CH/fiduciary-directory |
fields | url, name, address, phone, website |
page dynamicity | dynamic |
why? | because, we must click on a button to open all entries and click on each link |
crawler | static |
descrip. | I just get a list of urls from my chrome extension, put the urls in a list and read statically that list of urls |
should be | python + selenium |
could be | python + selenium/scrapy |
has been | python + scrapy |
why? | because instead of opening each entry dynamically (popup) after clicking on the button to access all entries, I have seen that the url is available in the html code which means that if it's enough with a static list of urls, we can use scrapy |
interesting facts | - the site is theoretically protected by Google Recaptcha and that recaptcha could be triggered if we interact with the site using an automation tool such as selenium but using scrapy has avoided that |
output | csv, xlsx |
sample |
output_beixio.csv
output_beixio.xlsx |
This project is done with Python and Scrapy based on a static crawler that generates 2 outputs
site | https://www.cost.eu/cost-actions-event/browse-actions/ |
fields |
output_chair_details.xlsx: _action_url, _action_code, _action_description, _chair_name, _chair_phone_number, _chair_email
output_member_details.xlsx: _action_url, _action_code, _action_description, _name, _person_details, _phone_number, _email, _country, _participating_actions |
page dynamicity | static |
why? | it's a static page, each tab creates a new url so we can construct the url (simple get) |
crawler | static |
descrip. | static list of urls |
should be | python + scrapy |
could be | - |
has been | python + scrapy |
why? | - |
interesting facts |
- instead of creating a xml, json or csv output, we create a .xlsx output
- to create the Excel output, I don't use the pipeline (I could have) and I use pandas - there are 2 outputs, 1 output per tab (2 tabs) - we must use an IP from the United States. I have used a simple VPN (ExpressVPN) |
output | xlsx |
sample |
output_chair_details.xlsx
output_member_details.xlsx |