Mission - Quick way to select new job listings for particular search
parameters in a minimalist format and learn something on the way
I was interested in the area of web-parsing and the idea of quickly gather
information from a webpage. My idea was to parse a job search website with the
parameters I was interested and find me new job offers in a effective manner.
A birdie told me that "The modern way to do it is called
Mechanicalsoup, Selenium is for oldies" - let's see.
Mechanicalsoup with URL parameters approach
First look at the library
Mechanicalsoup. It is built on requests
and
beautifulsoup
libraries, can follow links and submit forms. It sounds like an ideal
lightweight alternative to Selenium when you only want basic interaction and
flexible information scraping.
I defined the URL with parameters, parsed the listed jobs from the first page,
separated the information and outputted only the important bits (Name, URL,
location, remote possibility).
Everything looked fine, the results of the first page were listed,
Mechanicalsoup parsed the elements cleanly. Then something didn't add up - the
results of the second search seemed to be too unrelated to my search word. I
assumed that the browser object might have difficulty holding the cookies on
this page and just 'forgets' everything when it clicks 'Next'
I originally thought that I might fix it with adding some parameters
into the .submit_selected(), or tinkering with cookie options, but after
exhausting duckduckgo searches and willingness to study the direct and
indirect documentation, I decided to switch to another approach to at least
bring the use case to an successful end.
Mechanicalsoup is not yet ready to replace Selenium in all aspects of web
parsing (at least for some webpages). At least so I though.
Code is
here
Selenium with Beautifulsoup approach
Selenium is the person you call when you want this job done reliably. At
least that is mostly the case. I quickly fired up the chrome browser
(discovered that there is a new way of handling the driver - the file is no
longer needed, it is handled by
an command -
), set the parameters, navigated to the search page, seleted all necessary
locators to fire up the search. Everything went pretty uncomplicated, the
script navigated through the results, gathered all the job data with
Beautifulsoup -
and listed the jobs. I have also added a way how the script selects only the
newest jobs (those which he didn't find previously) - storing the search
results per search term in a pandas dataframe and each time listing only the
difference with the new search (afterwards updating the main dataframe):
ChromeDriverManager().install()
job_details = [text for text in soup_job.stripped_strings] a_tags = soup_job.find_all('a', href=True)
for job in all_jobs: if job[1] not in job_list_df.values : series = pd.Series(job) job_list_df_new = job_list_df_new.append(series, ignore_index=True) job_list_df = pd.DataFrame(all_jobs) # print(csv_file) if not job_list_df_new.empty: print("New jobs:", job_list_df_new) job_list_df_new.to_csv(csv_file_timestamped) else: print("There are no new jobs for this searchterm") job_list_df.to_csv(csv_file)
This did the job, at least for a week.
On the next Monday I woke up seeing that the script doesn't work anymore -
Selenium was failing to submit the search. I tried tinkering with the waits or
with the way how the script selects the elements, but to no avail, they seem
to (maybe on purpose) made the website quite hard to automate with selenium.
Code is
here
I was quite frustrated after this.
Will my work be in vain?
Will I ever be able to search for job offers elegantly?
Requests with Beautifulsoup approach
A revelation came the next day. When Mechanicalsoup is implementing both
Requests and Beautifulsoup, why not combine those directly and use their
full potential. I looked at the POST request the search query made,
experimented with parameters within and tried to figure out which ones are
crucial for the search and narrowed it down to the following set
data_test_zurich = {'__seo_search':'search', '__search_freetext':keyword, '__search_city':location[0], 'seal':random_search_id, '__search_city_location_id':location[1], '__search_city_country':location[2], '__search_city_perimeter':'100', 'search_id':random_search_id, 'search_simple':'suchen'}
'random_search_id' was necessary and I suppose it somehow groups together
the pagination of the search results. Giving those into a requests.Session() was giving me the desired behavior - consistent and reliable job listing across multiple result pages.
I put it together with the useful functionality from previous attempts (show only new jobs, remember the last search results), added some perks like better view in consoles and command line interface with click. At the end I basically did the whole circle - starting exploring with Mechanicalsoup, switching to old mate Selenium and finally solving the problem with the libraries on which Mechanicalsoup is build on.
Comments
Post a Comment