[VB.NET]Parsing text between HTML tags

Discussion in 'Mixed Languages' started by Muerto, Oct 14, 2012.

  1. Muerto

    Muerto MDL Debugger

    Mar 7, 2012
    1,858
    2,109
    60
    #1 Muerto, Oct 14, 2012
    Last edited: Jan 12, 2021
    ...
     
  2. stevemk14ebr

    stevemk14ebr MDL Senior Member

    Jun 23, 2010
    267
    48
    10
    #2 stevemk14ebr, Oct 14, 2012
    Last edited by a moderator: Apr 20, 2017
    try this
    Code:
     Dim str As String
        Dim strArr() As String
        Dim box3 As String
    
    Private Sub Form1_Load(sender As Object, e As EventArgs) Handles MyBase.Load
            str = <li class="ReleaseDate"><label>Release date:</label> 12/16/2010</li>
            strArr = str.Split(" ")
            box3 = strArr(2)
           TextBox3.Text = box3
    
    the reason this works is the the variable str is actually equal to "Release Date: 12/16/2010" because it reads the html so when i split with the .split method 3 things are put into the string array (strArr) Release, Date, and then the number series since arrays start at 0 the third is actually strArr(2)
    if you put
    Code:
     messagebox.show(str)
    you can see what str is actually equal to
     
  3. Muerto

    Muerto MDL Debugger

    Mar 7, 2012
    1,858
    2,109
    60
    #3 Muerto, Oct 14, 2012
    Last edited: Jan 12, 2021
    (OP)
    ...
     
  4. stevemk14ebr

    stevemk14ebr MDL Senior Member

    Jun 23, 2010
    267
    48
    10
    #4 stevemk14ebr, Oct 14, 2012
    Last edited by a moderator: Apr 20, 2017
    here ya go i tested it
    Code:
     Dim loc As Integer
        Dim str As String
    
        Private Sub Form1_Load(sender As Object, e As EventArgs) Handles MyBase.Load
    
            Dim srchString As String = "<label>Release date:</label>" 'Storing the string I'm looking for
    
            Dim net As New Net.WebClient() 'Add webclient
    
            Dim src As String = net.DownloadString("http://marketplace.xbox.com/en-US/Product/Fable-III/66acd000-77fe-1000-9115-d8024d5308d6") 'Downloading the HTML
    
            Dim rd As String
    
            rd = src.Substring(src.IndexOf(srchString) + srchString.Length, 50) 'Grabbing the data I want
    
            'This above code returns > "6/8/2011</li> <li class="FileSize"><labe" - Now, how can I trim EVERYTHING past </li>????
            'finds the start of </li>
            loc = rd.IndexOf("</li>")
            'removes everything past li with +5 to include the li because indexof returns the first character then puts 
            'the trimmed string into str
            str = rd.Remove((loc + 5))
            TextBox2.Text = str
        End Sub
    btw how do you keep the color in the code tags when u paste on here
     
  5. Muerto

    Muerto MDL Debugger

    Mar 7, 2012
    1,858
    2,109
    60
    #5 Muerto, Oct 14, 2012
    Last edited: Jan 12, 2021
    (OP)
    ...
     
  6. Calistoga

    Calistoga MDL Senior Member

    Jul 25, 2009
    421
    199
    10
    #6 Calistoga, Oct 14, 2012
    Last edited by a moderator: Apr 20, 2017
    Have you considered using regular expressions?

    The following expression matches all the dates in the HTML source:
    Code:
    \d{2}/\d{2}/\d{4}
    Test it with [C#]:
    Code:
    StringCollection resultList = new StringCollection();
    
    try
    {
    Regex regexObj = new Regex(@"\d{2}/\d{2}/\d{4}", RegexOptions.Multiline);
    Match matchResult = regexObj.Match(subjectString);
    
    while (matchResult.Success)
    {
    resultList.Add(matchResult.Value);
    matchResult = matchResult.NextMatch();
    } 
    }
    catch (ArgumentException ex)
    {
    // Syntax error in the regular expression
    }
    
    [VB.NET]:
    Code:
    Dim ResultList As StringCollection = New StringCollection()
    
    Try
    Dim RegexObj As New Regex("\d{2}/\d{2}/\d{4}", RegexOptions.Multiline)
    Dim MatchResult As Match = RegexObj.Match(SubjectString)
    
    While MatchResult.Success
    ResultList.Add(MatchResult.Value)
    MatchResult = MatchResult.NextMatch()
    End While
    Catch ex As ArgumentException
    'Syntax error in the regular expression
    End Try
    
    You can create more complex expressions depending on how you want to solve the problem. If you know that the string you're checking contains one, and only one date, then the following will work fine to retrieve it [C#]:
    Code:
    string resultString = null;
    
    try
    {
    resultString = Regex.Match(subjectString, @"\d{2}/\d{2}/\d{4}", RegexOptions.Multiline).Value;
    }
    catch (ArgumentException ex)
    {
    // Syntax error in the regular expression
    }
    
    [VB.NET]:
    Code:
    Dim ResultString As String
    
    Try
    ResultString = Regex.Match(SubjectString, "\d{2}/\d{2}/\d{4}", RegexOptions.Multiline).Value
    Catch ex As ArgumentException
    'Syntax error in the regular expression
    End Try
    
    And don't forget [C#]:
    Code:
    using System.Text.RegularExpressions;
    
    [VB.NET]:
    Code:
    Imports System.Text.RegularExpressions
    
    You can skip the try/catch-blocks if you want, but they might be useful if you change the expressions often.
     
  7. Muerto

    Muerto MDL Debugger

    Mar 7, 2012
    1,858
    2,109
    60
    #7 Muerto, Oct 15, 2012
    Last edited: Jan 12, 2021
    (OP)
    ...
     
  8. Calistoga

    Calistoga MDL Senior Member

    Jul 25, 2009
    421
    199
    10
    I hear ya. However, as long as the data is equally formatted everywhere, a regular expression will always work.

    Eventually, you could try the HtmlAgilityPack.
    Let us know how you did it when it's done!
     
  9. Alphawaves

    Alphawaves Super Moderator/Developer
    Staff Member

    Aug 11, 2008
    6,222
    22,280
    210
    #9 Alphawaves, Oct 15, 2012
    Last edited by a moderator: Apr 20, 2017
    Im not on a pc with c# or vb but i should think you could add this after your line of (rd = src.Substring(src.IndexOf(srchString) + srchString.Length, 50)):
    Code:
    while (rd.StartsWith(" "))
    {
        rd = rd.Substring(1);
    }
    while (rd.Contains("<"))
    {
        rd = rd.Substring(0, rd.Length - 1);
    }
    I have some other methods of getting string from urls, i will add more later, im messing around in windows 8 atm.. :eek::D
     
  10. Muerto

    Muerto MDL Debugger

    Mar 7, 2012
    1,858
    2,109
    60
    #10 Muerto, Oct 15, 2012
    Last edited: Jan 12, 2021
    (OP)
    ...