Charset Problem
I'm having a characterset problem with the titles in my Top100.groovy.
When I save in my browser the html page returned by youtube "http://www.youtube.com/playlist?list=MCUS" I get the line
<span class="title video-title " dir="ltr">Fun.: We Are Young ft. Janelle Monáe [OFFICIAL VIDEO]</span>
where "Monae" has the hex characters 4D 6F 6E E1 65
When I use html = new URL("http://www.youtube.com/playlist?list=MCUS").getText() in my groovy and save the text I get the line
<span class="title video-title " dir="ltr">Fun.: We Are Young ft. Janelle Monáe [OFFICIAL VIDEO]</span>
where "Monae" has the hex characters 4D 6F 6E C3 A1 65
My research says I need to specify a code page that correctly recognizes E1
So I look at http://www.fileformat.info/info/unicode ... upport.htm
and see that ISO-8859-1 contains the correct E1
and change the groovy to use html = new URL("http://www.youtube.com/playlist?list=MCUS").getText("ISO-8859-1")
but I still get
<span class="title video-title " dir="ltr">Fun.: We Are Young ft. Janelle Monáe [OFFICIAL VIDEO]</span>
where Monae still has the hex characters 4D 6F 6E C3 A1 65
Other charsets do the same. How do I get the correct characters returned in my groovy?
When I save in my browser the html page returned by youtube "http://www.youtube.com/playlist?list=MCUS" I get the line
<span class="title video-title " dir="ltr">Fun.: We Are Young ft. Janelle Monáe [OFFICIAL VIDEO]</span>
where "Monae" has the hex characters 4D 6F 6E E1 65
When I use html = new URL("http://www.youtube.com/playlist?list=MCUS").getText() in my groovy and save the text I get the line
<span class="title video-title " dir="ltr">Fun.: We Are Young ft. Janelle Monáe [OFFICIAL VIDEO]</span>
where "Monae" has the hex characters 4D 6F 6E C3 A1 65
My research says I need to specify a code page that correctly recognizes E1
So I look at http://www.fileformat.info/info/unicode ... upport.htm
and see that ISO-8859-1 contains the correct E1
and change the groovy to use html = new URL("http://www.youtube.com/playlist?list=MCUS").getText("ISO-8859-1")
but I still get
<span class="title video-title " dir="ltr">Fun.: We Are Young ft. Janelle Monáe [OFFICIAL VIDEO]</span>
where Monae still has the hex characters 4D 6F 6E C3 A1 65
Other charsets do the same. How do I get the correct characters returned in my groovy?