Quantcast
Channel: Python: (Beautifulsoup) How to limit extracted text from a html news article to only the news article. - Stack Overflow
Viewing all articles
Browse latest Browse all 4

Python: (Beautifulsoup) How to limit extracted text from a html news article to only the news article.

$
0
0

I wrote this test code which uses BeautifulSoup.

url = "http://www.dailymail.co.uk/news/article-3795511/Harry-Potter-sale-half-million-pound-house-Iconic-Privet-Drive-market-suburban-Berkshire-complete-cupboard-stairs-one-magical-boy.html"html = urllib.request.urlopen(url).read()  soup = BeautifulSoup(html,"lxml")for n in soup.find_all('p'):    print(n.get_text())

It works fine but it also retrieves text that is not part of the news article, such as the time it was posted, number of comments, copyrights ect.

I would wish for it to only retrieve text from the news article itself, how would one go about this?


Viewing all articles
Browse latest Browse all 4

Latest Images

Trending Articles





Latest Images