Beautiful Soup is a Python package for parsing HTML and XML documents, including those with malformed markup. It creates a parse tree for documents that can be used to extract data from HTML,[3] which is useful for web scraping.[2][4]
Original author(s) | Leonard Richardson |
---|---|
Initial release | 2004 |
Stable release | 4.12.3[1]
/ 17 January 2024 |
Repository |
|
Written in | Python |
Platform | Python |
Type | HTML parser library, Web scraping |
License |
|
Website | www |
Beautiful Soup was started by Leonard Richardson, who continues to contribute to the project,[5] and is additionally supported by Tidelift, a paid subscription to open-source maintenance.[6]
Beautiful Soup represents parsed data as a tree which can be searched and iterated over with ordinary Python loops.[7] The example below uses the Python standard library's urllib[8] to load Wikipedia's main page, then uses Beautiful Soup to parse the document and search for all links within.
#!/usr/bin/env python3
# Anchor extraction from HTML document
from bs4 import BeautifulSoup
from urllib.request import urlopen
with urlopen('https://en.wikipedia.org/wiki/Main_Page') as response:
soup = BeautifulSoup(response, 'html.parser')
for anchor in soup.find_all('a'):
print(anchor.get('href', '/'))
Beautiful Soup is named both after a poem in Alice's Adventures in Wonderland[9] and tag soup.[10]
Beautiful Soup 3 was the official release line of Beautiful Soup from May 2006 to March 2012. The current release is Beautiful Soup 4.x. Beautiful Soup 4 can be installed with pip install beautifulsoup4
.
In 2021, Python 2.7 support was retired and the release 4.9.3 was the last to support Python 2.7.[11]
Beautiful Soup is licensed under the same terms as Python itself