Scraping projects

Local.ch
Hybrid Python + Selenium-Scrapy project

This project is done with Python, Selenium and Scrapy but not in the usual way because Selenium is just used for the first interaction and everything else is done with Scrapy (crawl & scrap)

site	https://www.local.ch/en
fields	_name, _address, _phone, _mobile, _whatsapp, _email, _website
static or dynamic	dynamic site to get the url generated by the search. beyond that, crawling and scraping will be done by scrapy (dynamically on static data)
why?	we need the first interaction to be done with selenium (browser)
crawler	dynamic horizontal and vertical crawler
descrip.	after we get the url generated by the search, everything is static on the client side (no need to render javascript), therefore scrapy can crawl vertically and horizontally and scrap which makes the scraper much faster
should be	python + selenium
could be	selenium + scrapy
has been	selenium + scrapy
why?	because we just need the first interaction done with selenium (browser) and then everything is done with scrapy (crawling and scraping)
interesting facts	- we usually create scrapers either in selenium or in scrapy - sometimes, we can create an hybrid selenium-scrapy where selenium crawls and scrapy scraps - in this case, we need the first interaction in selenium but selenium doesn't have to crawl horizontally and vertically because with the search url we can construct the urls and scrapy will do the job which makes the scraper even faster because selenium is doing the bare minimum
output	csv
sample	output_local.ch.csv output_local.ch.xlsx

Real Estate
Python + Selenium project

This project is done with Python and Selenium but needs a very special way of using Selenium to circumvent the Kasada protection

site	https://www.realestate.com.au/buy/property-house-in-riverwood,+nsw+2210/list-1?includeSurrounding=false&source=refinement
fields	address, url, bedrooms, bathrooms, parking_space, land_size, auction_details
static or dynamic	dynamic
why?	Javascript need to be rendered
crawler	no crawler here since we just scrap one just to show that we are able to circumvent Kasada protection
descrip.	the scraper connect to a single and gets data circumventing kasada protection
should be	python + selenium
could be	-
has been	python + selenium
why?	-
interesting facts	- the usual selenium doesn't work so I had to find a very special way to be able to circumvent kasada. in usual selenium, the page won't load.
output	csv
sample	output_realestate.csv

Google Maps Mezcal and Agave spirits
Visual Basic.NET + Selenium project

This project is done with Visual Basic.NET and Selenium. It's a visual application that works for Windows operating system

site	https://www.google.com/maps/search/ each search is made as a mix of city, zipcode and county, for example Mezcal Spirit Lo Angeles Search Agave Spirit Los Angeles Search
fields	_keyword, _name, _full_address, _street_address, _city, _state, _zipcode, _country, _longitude, _latitude, _licence, _description, _services, _website, _phone, _wheelchair_accessible_entrance, _google_maps_url, _hours_of_operation, _monday_from, _monday_to, _tuesday_from, _tuesday_to, _wednesday_from, _wednesday_to, _thursday_from, _thursday_to, _friday_from, _friday_to, _saturday_from, _saturday_to, _sunday_from, _sunday_to
static or dynamic	dynamic
why?	Google Maps will simply not work without Javascript plus we must reach the best level of zoom (street level)
crawler	the crawler is static and dynamic: static in the sense the scraper uses a static list of cities, counties and zipcodes and dynamic in the sense that when we make a search, we crawl the results as an infinite loop to open a popup for each result
descrip.	the crawler reads a static list of urls, it creates searches, for each search it get results (it opens the infinite loop until reaching the end of the list) and for each result, it opens a popup
should be	visual basic.net + selenium
could be	-
has been	visual basic.net + selenium
why?	-
interesting facts	- the site is done with Visual Basic.NET instead of the usual Python and it works with MS-Access as database instead of MySQL or a PostgreSQL database. technologies used here are simply different but we are still using selenium - the crawler is NOT static or dynamic: it is both static on one hand (making the search itself) and dynamic on the other hand (crawling itself opening the infinite loop) - the scraper works at the best level of zoom of Google Maps (street level) - I can create an installer (setup.exe) to install it on any computer with Windows operating system (access database is part of the install but to make it work you must use Microsoft Office or an Open Source solution such as Open Office or Libre Office) - the scraper is optimized to avoid repetitions: when a result has already been treated, the popup won't be opened again, it makes the scraper faster - the crawler doesn't paginate, it deals with an infinite loop: it's some kinf of vertical crawling instead of horizontal crawling (pagination)
output	mdb, csv
sample	google_maps_liquors.csv google_maps_liquors.mdb

Mezcal Reviews
Python + Scrapy project

This project is done with Python and Scrapy, deals with an unsual pagination and creates a folders structure to download products images

site	https://www.mezcalreviews.com/filter-by/brand/
fields	product_to_scrap_url, brand_url, brand_name, brand_image_url, brand_directory, product_name, product_image_url, product_new_image_name, product_decription, category, cost, brand, mezcalero, maguey, agave, grind, fermentation, milling, distillation, style, state, town, ABV, website
static or dynamic	static
why?	-
crawler	the crawler is static but there is no next button without or without internal url, therefore we must find an internal way to paginate each brand to get access to all its products.
descrip.	the scraper paginates constructing internal static urls
should be	python + scrapy
could be	python + requests + beautiful soup python + requests + lxml
has been	python + scrapy
why?	-
interesting facts	- the site is static: we can work with python + requests + (beautiful soup or lxml) or even better: scrapy. - there is no first, previous, next, last button. we could have paginated through all pages of all brands not going through brands but in that case, the pagination goes by interval. working brand by brand is better because we can structure the brand image and the products images when downloading them. - the scraper downloads brands and products images and structure them in directories with the name of each brand. the brand image is always ordered in first position (begins with _)
output	csv
sample	output_mezcalreviews.csv images_mezcalreviews.rar images_mezcalreviews.zip

Substances
Python + Selenium project

This project is done with Python and Selenium where we find a way to avoid having to crawl from page 1 each time we get a record

site	https://ilv.ifa.dguv.de/substances
fields	substance_name, cas_no, remark, country, twa_ppm, twa_mg_m3, twa_f_cm3, stel_ppm, stel_mg_m3, stel_f_cm3
static or dynamic	dynamic
why?	because, we must click on a button to open all entries and click on each link
crawler	dynamic
descrip.	when clicking on the next page, we can access a xml file with a list of ids sufficient to create the urls we must scrap. we identify dynamically the number of pages based on 48 results per page. this allows us to avoid having to paginate plus when we click on an entry and go back to the list of substances, we always get back to page 1 which generates a lot of pagination because each time we want to get the data of an entry of a certain page, we must reach that page and paginate as many times as needed to get there.
should be	python + selenium
could be	-
has been	python + selenium
why?	-
interesting facts	- the fact that to get an entry data (component) in a certain page means paginating to that page implies that to get 48 entries in 48 pages, I must make 54144 requests ( sum_operator(2..47) x 48 = 1128 * 48) which is pretty inefficient and that is what the ids in the xml files resolves. - the rows goes 3 by 3: the first one is a value, the second one must be clicked to make the third one appear and get its value. it seems static but that third value is dynamic.
output	csv, xlsx
sample	output_substances.csv output_substances.xlsx

Fidiucaries
Python + Scrapy project

This project is done with Python and Scrapy instead of using Python and Selenium using a static crawler (list of urls)

site	https://www.bexio.com/en-CH/fiduciary-directory
fields	url, name, address, phone, website
page dynamicity	dynamic
why?	because, we must click on a button to open all entries and click on each link
crawler	static
descrip.	I just get a list of urls from my chrome extension, put the urls in a list and read statically that list of urls
should be	python + selenium
could be	python + selenium/scrapy
has been	python + scrapy
why?	because instead of opening each entry dynamically (popup) after clicking on the button to access all entries, I have seen that the url is available in the html code which means that if it's enough with a static list of urls, we can use scrapy
interesting facts	- the site is theoretically protected by Google Recaptcha and that recaptcha could be triggered if we interact with the site using an automation tool such as selenium but using scrapy has avoided that
output	csv, xlsx
sample	output_beixio.csv output_beixio.xlsx

Cost Association Platform Python + Scrapy project

This project is done with Python and Scrapy based on a static crawler that generates 2 outputs

site	https://www.cost.eu/cost-actions-event/browse-actions/
fields	output_chair_details.xlsx: _action_url, _action_code, _action_description, _chair_name, _chair_phone_number, _chair_email output_member_details.xlsx: _action_url, _action_code, _action_description, _name, _person_details, _phone_number, _email, _country, _participating_actions
page dynamicity	static
why?	it's a static page, each tab creates a new url so we can construct the url (simple get)
crawler	static
descrip.	static list of urls
should be	python + scrapy
could be	-
has been	python + scrapy
why?	-
interesting facts	- instead of creating a xml, json or csv output, we create a .xlsx output - to create the Excel output, I don't use the pipeline (I could have) and I use pandas - there are 2 outputs, 1 output per tab (2 tabs) - we must use an IP from the United States. I have used a simple VPN (ExpressVPN)
output	xlsx
sample	output_chair_details.xlsx output_member_details.xlsx