The Money Mustache Community
Learning, Sharing, and Teaching => Investor Alley => Topic started by: dungoofed on November 10, 2015, 04:13:04 PM
-
Hi - was wondering if anyone can help me scrape some data into Google Sheets.
I've just started getting into this whole "scraping" thing (yeah, only five years late). For listed stocks I can use GOOGLEFINANCE= and the Sheet populates nicely, but for things like warrants and bonds which aren't available I need another way to get the data.
An example would be from the following page:
http://www.asx.com.au/asx/markets/interestRateSecurityPrices.do?type=GOVERNMENT_BOND
If you go down to GSBK16, the value in the Valuation Price column (currently around 103.675) is the one I want to scrape. The way I've been trying to do it is using the formula:
=IMPORTXML("http://www.asx.com.au/asx/markets/interestRateSecurityPrices.do?type=GOVERNMENT_BOND","//*[@id='content']/table[1]/tbody/tr[2]/td[13]")
Google Sheets thinks for a moment then gives me a "#N/A Error - Imported Contents is Empty" error.
The way I got the Xpath was to open the page in Opera, select and inspect the element, then right click and select "Copy Xpath". I also changed the double-quotes around "content" to single quotes, and put the entire Xpath string in double-quotes.
Pulling my hair out. Anyone with more experience able to see why this isn't working?
-
I don't have the googles or a public spreadsheet to try this on, but you can try changing the wrapper quotes to single, and use double where they belong.
=IMPORTXML("http://www.asx.com.au/asx/markets/interestRateSecurityPrices.do?type=GOVERNMENT_BOND",'//*[@id="content"]/table[1]/tbody/tr[2]/td[13]')
And you might also try escaping the double quotes, https://www.google.com/webhp?hl=en#hl=en&q=xpath+escaping+double+quotes
Reading the docs... https://support.google.com/docs/answer/3093342?hl=en
Looks like the quotes might help. Also, are you getting the newest version of sheets in AU?
A maximum of 50 IMPORTXML calls are supported on a single spreadsheet. This limit is removed in the new version of Google Sheets.
-
Thanks, checking the above....
-
Hm can't seem to get anything to work. Have tried:
\"
CHAR(34)
plus all combinations of single, double and no quotes.
Regarding AU version, this is the only place I'm using IMPORTXML in the sheet at the moment.
-
I'm limited without a google sheet to try with... but I used to love building scraping apps with xpath.
try the importhtml() function.
https://productforums.google.com/forum/#!topic/docs/I1jR-b8wg48
-
Thanks, will check that too.
I tried to scrape the same data from a completely different page but same result. I'm thinking something might be wrong with the Xpath that Opera creates, so I might try and fire up Chrome and see if that makes a difference.
-
Thanks, will check that too.
I tried to scrape the same data from a completely different page but same result. I'm thinking something might be wrong with the Xpath that Opera creates, so I might try and fire up Chrome and see if that makes a difference.
This is totally possible, I was using Chrome to validate your query. However, the reliability of xpath queries depends on the quality of the HTML on the page. I used to have to scrub and clean (query further out, and string/find replace/remove) some badly written pages to get the data I wanted.
-
Hmm... it might be =ImportHTML that I need? Checking this too...
-
Hmm... it might be =ImportHTML that I need? Checking this too...
I'd try that... cause if the HTML is bad, xpath could be inaccurate, where importhtml() might read it differently.
-
Hm.. ImportHTML doesn't even manage to return the entire table. I'm thinking there is something incompatible between the page and Google Sheets
-
Using either function see if you can import any cell in the example table. Assuming W3 would have perfect HTML and the table is small.
http://www.w3schools.com/html/html_tables.asp
-
I think it's an issue with the page. Firefox, Opera and Chrome give me significantly different Xpaths
-
Bummer.
I don't know how determined you are...
http://webscraper.io
http://scrapy.org
-
Thanks, will have another look in a bit.
-
Still haven't solved it. I might have found a clue though.
As per the screenshot I can get some data with =IMPORTXML(A9,"//html")
(where A9 has my URL). The thing that is odd is that it appears three columns over. Not sure if this is a clue or not.
edit: the data I want is actually in that F-column cell further down, but I couldn't show it and take a screenshot at the same time.
-
Is it possible that Google doesn't want to take the parameter directly and wants it from a cell?
Try putting the path in a cell where A9 is the URL and A10 is the xpath.
=importxml(A9,A10)
-
Actually most of the time I have been doing it that was : (
-
If you're familiar with programming (or want to learn) you could use something like PHP to do an http request to the page you want to scrape and manually parse out the data. I've done this several times in the past.
-
Willing to give that a go too, but I'd like to use the data in Google Sheets if possible. Let me check if there are any PHP add-ons for Google Sheets.
-
For what you want to extract from the web page you provided. I think Octoparse http://www.octoparse.com/ (http://www.octoparse.com/) may interest you.
It's easy to use. just clicking and pointing the data you want from the websites, and Octoparse will get the data for you automatically. It won't let you down.
Download it now: http://www.octoparse.com/download
-
If you don't want to download software or write code to scrape websites you can use online GrabzIt's Web Scraper (https://grabz.it/scraper/) to scrape the information you need. If you export the data as XLSX or CSV you will be able to open it in Google Sheets.
-
@ Domus - The post before yours is from 2016, so this might not be relevant anymore.
-
I'm having the same trouble. Doe anyone know how =importxml treats a password-protected page? I'm logged in to the site, but it's giving me the same error OP got.
-
I'm having the same trouble. Doe anyone know how =importxml treats a password-protected page? I'm logged in to the site, but it's giving me the same error OP got.
Google will send a request from their server to the URL you give, and they don't have your password. You can only get publicly-accessible pages using Google Sheets.
-
Kinda late to the party (more than 120 days late..)
just came across this post and thought id share,
Parsehub has a tutorial on how to scrape data into google sheets
https://www.parsehub.com/blog/scrape-web-content-into-google-sheets/
just gotta put in the formula =importadata() in a1 cell once you're done