Web scraping with python
Today I am going to write about web scraping. At first, let’s look at this definition of web scrapping.
Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. The web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.
-Wikipedia
Suppose you are making a large dataset that requires various text samples. This could be achieved by web scraping. Or suppose you are making a price comparison website or you are making a job listing website. You might want to collect email addresses from different websites. You might want to collect multiple file download links automatically.
So, If I say in simple words, we are going to extract or collect data from websites using computer programming.
The most important thing is, as you will be extracting data from various websites, some data might be proprietary. So, make sure you are extracting data from the websites that give proper permissions. Otherwise, there is a chance you might be breaking a law!
As you know now what web scraping is, why to use it, and how to use it legally….Now it is time to focus on our main topic. This tutorial focuses on doing web scraping using python.
At first, let’s know about some libraries available in python for web scraping,
- Scrapy
- Requests
- urllib
- Beautiful soup
- Selenium etc.
In this tutorial, I will be focusing on beautiful soup. I might talk about other libraries in later articles.
Beautiful soup
Beautiful Soup is a Python package for extracting data from XML and HTML documents. It is basically a parser library.
How to install it?
At first, make sure you have python and pip installed. Then type the following command in the terminal.
pip install beautifulsoup4
That’s it! We can start using it now!!
How to use it?
- Get the HTML/XML page first
- Then use beautiful soup
Lets do some quick coding, first import these,
from bs4 import BeautifulSoup
import requests
Get contents from a page
pageget = requests.get("samplesite.com/….")
bsoup = BeautifulSoup(pageget.text, 'html.parser')
now lets collect all href links and put them in a list,
links = bsoup.find_all('a')
list = []
for x in links:
list.append( x.get('href'))
print(list)
This is just a single example I have shown. But there are a lot of things you can do with Beautiful soup. Hopefully, I will teach you many things in upcoming articles.
Till then Bye Bye!