[VB.NET]Parsing text between HTML tags

Discussion in 'Mixed Languages' started by QuantumBug, Oct 14, 2012.

  1. QuantumBug

    QuantumBug MDL Developer

    Mar 7, 2012
    1,488
    1,322
    60
    #1 QuantumBug, Oct 14, 2012
    Last edited by a moderator: Apr 20, 2017
    Basically, I'm using a streamreader to load the HTML of a site into a buffer to read from.

    I'm reading the HTML and grabbing what I want fine using .IndexOf("") but the thing is I'm not sure on how to stop it from reading after the tag ends.

    Because of this I'm using string.Length and say I'm grabbing a release date I get 7/3/11</li> and more jargon i don't want after the tag </li> blah blah.

    To put what I want in simple terms, here is an example of what I'm reading.

    Code:
    <li class="ReleaseDate"><label>Release date:</label> 12/16/2010</li>
    
    How could I read JUST the date from that line of code using .IndexOf? Can I tell it to end reading the string at </li>? I'm really stuck with this :(
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  2. stevemk14ebr

    stevemk14ebr MDL Senior Member

    Jun 23, 2010
    267
    48
    10
    #2 stevemk14ebr, Oct 14, 2012
    Last edited by a moderator: Apr 20, 2017
    try this
    Code:
     Dim str As String
        Dim strArr() As String
        Dim box3 As String
    
    Private Sub Form1_Load(sender As Object, e As EventArgs) Handles MyBase.Load
            str = <li class="ReleaseDate"><label>Release date:</label> 12/16/2010</li>
            strArr = str.Split(" ")
            box3 = strArr(2)
           TextBox3.Text = box3
    
    the reason this works is the the variable str is actually equal to "Release Date: 12/16/2010" because it reads the html so when i split with the .split method 3 things are put into the string array (strArr) Release, Date, and then the number series since arrays start at 0 the third is actually strArr(2)
    if you put
    Code:
     messagebox.show(str)
    you can see what str is actually equal to
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  3. QuantumBug

    QuantumBug MDL Developer

    Mar 7, 2012
    1,488
    1,322
    60
    #3 QuantumBug, Oct 14, 2012
    Last edited by a moderator: Apr 20, 2017
    (OP)
    Couldn't get it to work.

    Literally the whole webpage is stored so it can be read from. I can get the release date fine, 12/16/2010</li> I want to remove everything efter the red.

    Well, I can get the release date on it's own ok, but each Xbox 360 game in the site has some information on the game and it's character count always changes and the tag at the end will always be the same, so cutting of anything past </li> is essential.

    So what I'd do is read from the index needed to get the data and remove any data after a matching string. I'll post my code and edit here.

    EDIT:

    Code:
        
    Private Sub Button2_Click(sender As System.Object, e As System.EventArgs) Handles Button2.Click
    
     
    Dim srchString As String = "<label>Release date:</label>"'Storing the string I'm looking for
     
    Dim net As New Net.WebClient() 'Add webclient
     
    Dim src As String = net.DownloadString("http://marketplace.xbox.com/en-US/Product/Fable-III/66acd000-77fe-1000-9115-d8024d5308d6") 'Downloading the HTML
     
    Dim rd As String
    
            rd = src.Substring(src.IndexOf(srchString) + srchString.Length, 50) 'Grabbing the data I want
     
    'This above code returns > "6/8/2011</li> <li class="FileSize"><labe" - Now, how can I trim EVERYTHING past </li>????
    
            TextBox2.Text = rd    
    End Sub
    
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  4. stevemk14ebr

    stevemk14ebr MDL Senior Member

    Jun 23, 2010
    267
    48
    10
    #4 stevemk14ebr, Oct 14, 2012
    Last edited by a moderator: Apr 20, 2017
    here ya go i tested it
    Code:
     Dim loc As Integer
        Dim str As String
    
        Private Sub Form1_Load(sender As Object, e As EventArgs) Handles MyBase.Load
    
            Dim srchString As String = "<label>Release date:</label>" 'Storing the string I'm looking for
    
            Dim net As New Net.WebClient() 'Add webclient
    
            Dim src As String = net.DownloadString("http://marketplace.xbox.com/en-US/Product/Fable-III/66acd000-77fe-1000-9115-d8024d5308d6") 'Downloading the HTML
    
            Dim rd As String
    
            rd = src.Substring(src.IndexOf(srchString) + srchString.Length, 50) 'Grabbing the data I want
    
            'This above code returns > "6/8/2011</li> <li class="FileSize"><labe" - Now, how can I trim EVERYTHING past </li>????
            'finds the start of </li>
            loc = rd.IndexOf("</li>")
            'removes everything past li with +5 to include the li because indexof returns the first character then puts 
            'the trimmed string into str
            str = rd.Remove((loc + 5))
            TextBox2.Text = str
        End Sub
    btw how do you keep the color in the code tags when u paste on here
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  5. QuantumBug

    QuantumBug MDL Developer

    Mar 7, 2012
    1,488
    1,322
    60
    Copy and paste straight from Visual Studio but don't use IE, use Forefox :)

    Thanks for the code, will impliment it after dinner and write back.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  6. Calistoga

    Calistoga MDL Senior Member

    Jul 25, 2009
    420
    198
    10
    #6 Calistoga, Oct 14, 2012
    Last edited by a moderator: Apr 20, 2017
    Have you considered using regular expressions?

    The following expression matches all the dates in the HTML source:
    Code:
    \d{2}/\d{2}/\d{4}
    Test it with [C#]:
    Code:
    StringCollection resultList = new StringCollection();
    
    try
    {
    Regex regexObj = new Regex(@"\d{2}/\d{2}/\d{4}", RegexOptions.Multiline);
    Match matchResult = regexObj.Match(subjectString);
    
    while (matchResult.Success)
    {
    resultList.Add(matchResult.Value);
    matchResult = matchResult.NextMatch();
    } 
    }
    catch (ArgumentException ex)
    {
    // Syntax error in the regular expression
    }
    
    [VB.NET]:
    Code:
    Dim ResultList As StringCollection = New StringCollection()
    
    Try
    Dim RegexObj As New Regex("\d{2}/\d{2}/\d{4}", RegexOptions.Multiline)
    Dim MatchResult As Match = RegexObj.Match(SubjectString)
    
    While MatchResult.Success
    ResultList.Add(MatchResult.Value)
    MatchResult = MatchResult.NextMatch()
    End While
    Catch ex As ArgumentException
    'Syntax error in the regular expression
    End Try
    
    You can create more complex expressions depending on how you want to solve the problem. If you know that the string you're checking contains one, and only one date, then the following will work fine to retrieve it [C#]:
    Code:
    string resultString = null;
    
    try
    {
    resultString = Regex.Match(subjectString, @"\d{2}/\d{2}/\d{4}", RegexOptions.Multiline).Value;
    }
    catch (ArgumentException ex)
    {
    // Syntax error in the regular expression
    }
    
    [VB.NET]:
    Code:
    Dim ResultString As String
    
    Try
    ResultString = Regex.Match(SubjectString, "\d{2}/\d{2}/\d{4}", RegexOptions.Multiline).Value
    Catch ex As ArgumentException
    'Syntax error in the regular expression
    End Try
    
    And don't forget [C#]:
    Code:
    using System.Text.RegularExpressions;
    
    [VB.NET]:
    Code:
    Imports System.Text.RegularExpressions
    
    You can skip the try/catch-blocks if you want, but they might be useful if you change the expressions often.
     
  7. QuantumBug

    QuantumBug MDL Developer

    Mar 7, 2012
    1,488
    1,322
    60
    #7 QuantumBug, Oct 15, 2012
    Last edited: Oct 15, 2012
    (OP)
    @Steve14mbr

    It works, but still the length of all data differs and it still doesn't delete everything after a set point of characters. It can range from a date... 2/3/10 = 9 characters total, to a whole essay of information on the game...

    @Calistoga

    Thanks, but literally, for every game all data is different, this includes the release date, rating, PEGI, information, title, titleid, features, etc.

    I think I'm looking into this too much, there has to be a more simple way, I'm just drowned in code haha. I think it's stupid you can't just read between the tags... All hail XML. It's a shame as I've come really far with it now I'm stuck at a brick wall.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  8. Calistoga

    Calistoga MDL Senior Member

    Jul 25, 2009
    420
    198
    10
    I hear ya. However, as long as the data is equally formatted everywhere, a regular expression will always work.

    Eventually, you could try the HtmlAgilityPack.
    Let us know how you did it when it's done!
     
  9. Alphawaves

    Alphawaves Super Moderator/Developer
    Staff Member

    Aug 11, 2008
    5,873
    20,134
    180
    #9 Alphawaves, Oct 15, 2012
    Last edited by a moderator: Apr 20, 2017
    Im not on a pc with c# or vb but i should think you could add this after your line of (rd = src.Substring(src.IndexOf(srchString) + srchString.Length, 50)):
    Code:
    while (rd.StartsWith(" "))
    {
        rd = rd.Substring(1);
    }
    while (rd.Contains("<"))
    {
        rd = rd.Substring(0, rd.Length - 1);
    }
    I have some other methods of getting string from urls, i will add more later, im messing around in windows 8 atm.. :eek::D
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  10. QuantumBug

    QuantumBug MDL Developer

    Mar 7, 2012
    1,488
    1,322
    60
    Thanks for the code, I'm just messing about with the application right now. And yeah that would be awesome. :)
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...