I wrote this test code which uses BeautifulSoup.
url = "http://www.dailymail.co.uk/news/article-3795511/Harry-Potter-sale-half-million-pound-house-Iconic-Privet-Drive-market-suburban-Berkshire-complete-cupboard-stairs-one-magical-boy.html"html = urllib.request.urlopen(url).read() soup = BeautifulSoup(html,"lxml")for n in soup.find_all('p'): print(n.get_text())
It works fine but it also retrieves text that is not part of the news article, such as the time it was posted, number of comments, copyrights ect.
I would wish for it to only retrieve text from the news article itself, how would one go about this?