How I scrapped 6000 Physics faculty member information
When I decided to apply for Physics graduate programs in the United States, I had to check out each program and the faculty members to make sure the research field of the program or the faculty members aligned with my interests. The way usually people search for graduate programs is by search engines like petersons. As a graduate student who is also in touch with the experience of other graduate students as my friends, I’m pretty sure that the graduate experience is by far determined by the graduate advisors, not the program or the university. The problem with these services like petersons is that they are more focused on the programs, not the faculty members. Hence, the importance of having a database of faculty information becomes clear. But how we can get the information of roughly all faculty members of a program in the United States?
One way to get faculty information is by scrapping the data from each program webpage and faculty personal webpages linked to it. To get more information about the research area of the faculty one needs to also get information from google scholar. Since I was foreseeing a scalable project which can be applied to other majors not only a the physics program, I had to write a smart crawler to scrape data regardless of the program page design.
How does the smart crawler work?
For scrapping a website one needs to first inspect the page design to find the sources of the information and set it up in the crawler code. The problem with this approach is that we have to customize the crawler for each program website. Since in the future, I’m going to expand this project to other programs, this an excellent way to automate the work. The smart crawler gets the program URL and maps the HTML source tree. Then it looks for the images on the page. Usually, the faculty members on the program page have a photo. The crawler finds the nodes with the image and scrapes the information which is wrapped around the picture. This information is usually the name, phone number, the position, and the link to the faculty personal website.
If the faculty don’t have images on the program website, the crawler looks for the node in the HTML tree which has maximum children. Usually, the node with maximum children is the one with faculty information on the page. The crawler goes through each child which presumably has the information of each faculty member. The algorithm for the smart crawler proved to be effective for more than 80% of programs.
How to get each faculty research interest?
In addition to the basic information of each faculty, one needs to know more about their research area. This information can be scrapped from the faulty personal website, however many times this information is not available or not updated. The best way to get this information is can be by looking at each faculty google scholar account. Here the crawler gets the 15 titles and abstracts, research interests, and coauthors and stores them. I used google scholar API and changed the source code to connect to scraperapi rather than sending requests to google scholar which result in blocking my requests or at worst blocking my IP address.