Date

Oct 2017

Category

Data Viz

The dataset source is from the Guardian's datablog.

The 1,000 songs database only had 5 features: Theme, Title, Artist, Year, Spotify URL, so I attempted a beautiful soup webscrape of the wikipedia infobox. While I successfully scraped one, I realized it would be cumbersome to loop through all artists, particularly because they do not have standardized names. So I did the 2000's old-school method of internet searching each and every unique artist (there were 600+) and getting their genre, location, group/solo, gender. It took me one weekend to complete, and I thought it was well worth it, as I can be more enriching and flexible with my analytics.



Beautiful soup code follows.


					from bs4 import BeautifulSoup
					from urllib.request import urlopen
					url= "http://en.wikipedia.org/wiki/The_Beatles"
					
					page = urlopen(url)?-
					soup = BeautifulSoup(page.read(), "lxml")
					table = soup.find('table', class_='infobox vcard plainlist')
					result = {}
					exceptional_row_count = 0
					for tr in table.find_all('tr'):
						if tr.find('th'):
							result[tr.find('th').text] = tr.find('td').text if tr.find('td') else None
						else:
							exceptional_row_count += 1
					if exceptional_row_count > 1:
						print ('WARNING ExceptionalRow>1: ', table)
					print (result)
					soup = BeautifulSoup(page.read(), "lxml")
					table = soup.find('table', class_='infobox vcard plainlist')
					result = {}
					exceptional_row_count = 0
					for tr in table.find_all('tr'):
						if tr.find('th'):
							result[tr.find('th').text] = tr.find('td').text if tr.find('td') else None
						else:
							exceptional_row_count += 1
					if exceptional_row_count > 1:
						print ('WARNING ExceptionalRow>1: ', table)
					print (result)
					
					{'The Beatles': None, 'Background information': None, 'Origin': 'Liverpool, England', 'Genres': '\n\n\nRock\npop\n\n\n', 'Years active': '1960ñ1970', 'Labels': '\n\n\nParlophone\nApple\nCapitol\n\n\n', 'Associated acts': '\n\n\nThe Quarrymen\nBilly Preston\nPlastic Ono Band\n\n\n', 'Website': 'thebeatles.com', '': None, 'Past members': '\n\nJohn Lennon\nPaul McCartney\nGeorge Harrison\nRingo Starr\n\nSee members section for others'}