Remove duplicate lines from text-based files

Discussion in 'Scripting' started by Thomas Dubreuil, Jun 8, 2019.

  1. Thomas Dubreuil

    Thomas Dubreuil MDL Senior Member

    Aug 29, 2017
    304
    504
    10
    #1 Thomas Dubreuil, Jun 8, 2019
    Last edited: Jun 13, 2019
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  2. Thomas Dubreuil

    Thomas Dubreuil MDL Senior Member

    Aug 29, 2017
    304
    504
    10
    Now made 2 different ones, it is much faster (with the correct regex), and it supports unicode encoding.
    (Still don't know what happens with file without BOM, but it should work...)

    Contrary to simple batch scripts, it is a very robust solution, supporting any strange character, and relatively fast.
    It is based on JRepl script from dbenham (thanks to him for the utility, and also for sharing the correct regex syntax)

    https://github.com/Thdub/Batch_Scripts/blob/master/Utilities/RemoveDuplicateLines.bat
    Remove duplicate lines, blank lines or lines containing only white space from any text-based file
    Keeps last duplicated occurence.
    Usage : just drag and drop your file on to this script.
    Supports files with ASCII, UTF-8 and UNICODE character encoding

    https://github.com/Thdub/Batch_Scripts/blob/master/Utilities/Remove_duplicate_and_blank_lines.bat
    Remove duplicate lines from any text-based file, while preserving blank lines.
    Usage : just drag and drop your file on to this script.
    Keeps last duplicated occurence.
    Supports files with ASCII, UTF-8 and UNICODE character encoding
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  3. ohenry

    ohenry MDL Novice

    Aug 10, 2009
    39
    20
    0
    Not to rain on your parade or anything, but why not just use "uniq" ? If you're running Windows, you really should have the cygwin utilities anyway.
     
  4. Thomas Dubreuil

    Thomas Dubreuil MDL Senior Member

    Aug 29, 2017
    304
    504
    10
    #4 Thomas Dubreuil, Jun 16, 2019
    Last edited: Jul 7, 2019
    (OP)
    @ohenry Because uniq only works on sorted files/lists ;) .
    And because it can only remove adjacent duplicate lines, you have to use "sort" in conjunction if your lines are not ordered (which means losing initial line ordering)
    While these scripts keep the initial line ordering (or disorder, if you prefer).
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  5. rayleigh_otter

    rayleigh_otter MDL Expert

    Aug 8, 2018
    1,121
    905
    60
    Thomas, your remove duplicate and blank lines bat, just drag a file onto it and it deletes duplicates?
    Bloody hell :eek:, talk about great timing. Pray it works. :)
     
  6. rayleigh_otter

    rayleigh_otter MDL Expert

    Aug 8, 2018
    1,121
    905
    60
    Ive got the outputs of 6 different privacy tools that i need to seperate into hklm and hkcu then check for duplicates. ;)
     
  7. Thomas Dubreuil

    Thomas Dubreuil MDL Senior Member

    Aug 29, 2017
    304
    504
    10
    #7 Thomas Dubreuil, Jun 16, 2019
    Last edited: Jun 16, 2019
    (OP)
    That's the idea, simple drag and drop.
    It works with most kinds of encoding, strange characters, preserves line ordering while keeping last duplicated line occurence...works fine with .reg ;)

    ps: We can also do that in Notepad++ using regex (regular expression) searching for ^(.*?)$\s+?^(?=.*^\1$) and replace with nothing, but for me drag and drop is simpler sometimes.

    What is cool with notepad++ search is that you can also "mark" lines, then cut/copy/path marked lines...
    Also compare plugin for notepad++ is a must have (be sure update to latest v2 plugin version because vertical scrolling "lock" was broken due to plugin incompatibility with latest notepad++).
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  8. rayleigh_otter

    rayleigh_otter MDL Expert

    Aug 8, 2018
    1,121
    905
    60
    I use notepad2 mod, ive never had to do this before. Saw this thread a day or 2 ago, glad i bookmarked it. :)
     
  9. rayleigh_otter

    rayleigh_otter MDL Expert

    Aug 8, 2018
    1,121
    905
    60
    Created a file copy, ran it on your bat, compared both files properties, it does something alright :) :worthy:
     
  10. BAU

    BAU MDL Senior Member

    Feb 10, 2009
    457
    766
    10
    Just found these old but still useful scripts for this task (to use, drag & drop a folder containing all the reg files to the regmerge.bat, and the output.reg to regsort.bat)
    regmerge.bat
    Code:
    :: REGMERGE by AveYo
    :: my script bad, badder than yours - Some Rights Reserved 2011 AveYo
    ::
    :: FEATURES:
    ::  Merges an input folder containing *.reg files into a single input.reg
    ::  Combine tweaks, the easy way.
    ::
    :: SCRIPT NOTE:
    ::  Should work on a default, working Windows 7
    :: - no need for admin rights, sorting will be done without importing into live OS
    :: - no retard prompts, it will timeout self if no action from user
    ::
    :: MAP:
    ::  input folder with *.reg --start--> --combine--> input.reg
    ::
    :: Let's begin...
    ::
    :: Start processing
    @ECHO OFF &SETLOCAL ENABLEEXTENSIONS ENABLEDELAYEDEXPANSION
    SET _MYVER=1.0.2
    PUSHD "%~dp0"
    
    IF NOT EXIST "%~1\*.reg" echo. &echo  INPUT MISSING - Usage: REGMERGE foldername containing *.reg &echo  Drag and drop a folder into REGMERGE.bat &PING localhost >nul 2>&1 &EXIT /B
    :: start with countdown so you can abort
    FOR /L %%I IN (10,-1,1) DO (
    CLS &echo. &echo  INFO: Merging %~n1\*.reg. &echo  Starting in %%I seconds. Press [Ctrl+C] to cancel... &SET "LOGO=v%_MYVER% b^y ^Av^e^Y^o^"
    PING -n 2 127.0.0.1>nul 2>&1
    )
    :: count files
    PUSHD "%~1"
    FOR /F %%I IN ('dir /a-d "*.reg" ^| FIND /C "/"') DO SET "FILECOUNT=%%I" &TITLE REGMERGE %LOGO% &COLOR F0
    PUSHD ..
    SET "OUTFILE=%CD%\%~n1.reg"
    POPD
    
    :: header
    ECHO/Windows Registry Editor Version 5.00>%OUTFILE% &ECHO/>>%OUTFILE%
    :: merging every reg file
    FOR %%A IN (*.reg) DO (
    CLS &echo. &echo  Please wait while merging !FILECOUNT! registry files...
    FOR /F "TOKENS=1* DELIMS=]" %%H IN ('TYPE "%%A" ^| FIND /V /N ""') DO (
    SET "STR=%%I"
    CALL :TRIM_STR
    SET "REGSTR=!STR!"
    SET "REGVALUE=Y"
    IF /I "!REGSTR:~0,36!"=="Windows Registry Editor Version 5.00" SET "REGVALUE=" &&ECHO/>>%OUTFILE% &ECHO/;REGMERGE %%A>>%OUTFILE% &&ECHO/;============================================================>>%OUTFILE%
    IF DEFINED REGVALUE ECHO/!REGSTR!>>%OUTFILE%
    )
    SET /A "FILECOUNT-=1"
    )
    :: Job Done!
    CLS &echo. &echo  Done, merged to "%OUTFILE%"
    :: Cleanup
    POPD &TITLE CMD & COLOR 0F
    PING localhost >nul 2>&1 &EXIT /B
    
    
    
    ::
    :: Internal functions
    ::
    :TRIM_STR
    :: Usage: CALL :TRIM_STR
    :: Description: Remove preceding and ending spaces from STR var
    IF "!STR:~-1!"==" " SET "STR=!STR:~0,-1!" & GOTO :TRIM_STR
    IF "!STR:~0,1!"==" " SET "STR=!STR:~1,-1!!STR:~-1!" & GOTO :TRIM_STR
    GOTO :eof
    ::END.TRIM_STR
    ::
    
    
    regsort.bat
    Code:
    :: REGSORT by AveYo
    :: my script bad, badder than yours - Some Rights Reserved 2011 AveYo
    ::
    :: FEATURES:
    ::  Many registry tweaks from the web have duplicate and conflicting entries and are in general a mess.
    ::  This batch file will try to sort out entries and remove duplicates.
    ::  Unfortunately outside comments cannot be preserved (but who needs them when entries tend to be self-explanatory).
    ::
    :: SCRIPT NOTE:
    ::  Should work on a default, working Windows 7
    :: - no need for admin rights, sorting will be done without importing into live OS
    :: - no retard prompts, it will timeout self if no action from user
    ::
    :: MAP:
    ::  input.reg --start--> %Temp%\input_sorted\source.r1 --fixdelayedexpansion--> source.r2 --split--> *.r --sort--> *.r1
    :: --combine--> output.r1 --unfixdelayedexpansion--> output.r2 --done--> input_sorted.reg
    ::
    :: Let's begin...
    ::
    :: Start processing
    SET _MYVER=1.0.21
    PUSHD "%~dp0"
    @ECHO OFF &SETLOCAL ENABLEEXTENSIONS
    IF NOT EXIST "%1" echo. &echo  INPUT MISSING - Usage: REGSORT filename.reg &echo  Drag and drop a .reg file into REGSORT &PING localhost >nul 2>&1 &EXIT /B
    IF /I "%~x1"==".REG" (CLS) ELSE echo. &echo  INVALID INPUT - Usage: REGSORT filename.reg &echo  drag and drop a .reg file into REGSORT &PING localhost >nul 2>&1 &EXIT /B
    FOR /L %%I IN (10,-1,1) DO (
    CLS &echo. &echo  INFO: Sorting %~nx1 will require many temporary files so it might be slow. &echo  Starting in %%I seconds. Press [Ctrl+C] to cancel...
    PING -n 2 127.0.0.1>nul 2>&1
    )
    CLS &echo.
    :: Setup paths
    SET "FPATH=%~dp1" &SET "FNAME=%~n1"
    SET "FPATH=%FPATH:~0,-1%"
    RD /S /Q "%TEMP%\%FNAME%_sorted\" >nul 2>&1 &TITLE %~n0 %FNAME%.reg by AveYo &COLOR F0
    MD "%TEMP%\%FNAME%_sorted"
    IF NOT EXIST "%TEMP%\%FNAME%_sorted\*" echo  ERROR! temporary folder unavailable, please retry &PING localhost >nul 2>&1 &EXIT /B
    COPY /Y "%FPATH%\%FNAME%.reg" "%TEMP%\%FNAME%_sorted\source.r1" >nul 2>&1
    PUSHD "%TEMP%\%FNAME%_sorted"
    :: header
    ECHO/Windows Registry Editor Version 5.00>output.r1 &ECHO/>>output.r1
    ::
    :: Safeguard for batch-troublesome characters.
    :: Using Delayedexpansion does a great job preserving special characters except [!]
    :: and working with strings containing [&] breaks batch processing causing loss of data
    :: so temporarily use character substitution - that needs to be reverted in the final step.
    CLS &echo. &echo  Please wait while safeguarding batch-troublesome characters...
    CALL :SAFEGUARD_CHARS source.r1 source.r2
    ::
    :: Split hives
    FOR /F %%I IN ('TYPE "source.r2" ^| FIND /C "[H"') DO SET "HIVECOUNT=%%I"
    CLS &echo. &echo  Please wait while splitting %HIVECOUNT% registry hives...
    SET "REGFILE=_HIVES"
    SETLOCAL ENABLEDELAYEDEXPANSION
    FOR /F "TOKENS=1* DELIMS=]" %%H IN ('TYPE "source.r2" ^| FIND /V /N ""') DO (
    SET "STR=%%I"
    CALL :TRIM_STR
    SET "REGSTR=!STR!"
    SET "REGKEYDEL="
    SET "REGKEY="
    SET "REGVALUE=Y"
    rem FOR /F "TOKENS=* DELIMS= " %%S IN ("!REGSTR!") DO SET "REGSTR=%%S"
    IF "!REGSTR:~-2!"=="\]" SET "REGSTR=!REGSTR:~0,-2!]"
    IF "!REGSTR!"=="" SET "REGVALUE="
    IF "!REGSTR:~0,1!"==";" SET "REGVALUE="
    IF /I "!REGSTR:~0,36!"=="Windows Registry Editor Version 5.00" SET "REGVALUE="
    IF /I "!REGSTR:~0,4!"=="[-HK" CALL :STRIPSYMBOLS_INTO_STR_VAR !REGSTR! &SET "REGFILE=!STR!" &SET "REGKEYDEL=Y" &SET "REGVALUE="
    IF /I "!REGSTR:~0,3!"=="[HK" CALL :STRIPSYMBOLS_INTO_STR_VAR !REGSTR!  &SET "REGFILE=!STR!" &SET "REGKEY=Y" &SET "REGVALUE="
    IF EXIST "Z_DELETED_HIVES.r" (
    IF DEFINED REGKEYDEL FIND /I "!REGSTR!" < "Z_DELETED_HIVES.r" >NUL 2>&1 &IF ERRORLEVEL 1 ECHO/!REGSTR!>> "Z_DELETED_HIVES.r"
    ) ELSE (
    IF DEFINED REGKEYDEL ECHO/!REGSTR!>> "Z_DELETED_HIVES.r"
    )
    IF EXIST "!REGFILE!.r" (
    IF DEFINED REGKEY FIND /I "!REGSTR!" < "!REGFILE!.r" >NUL 2>&1 &IF ERRORLEVEL 1 ECHO/!REGSTR!>> "!REGFILE!.r"
    ) ELSE (
    IF DEFINED REGKEY ECHO/!REGSTR!>> "!REGFILE!.r"
    )
    IF EXIST "!REGFILE!.r" IF DEFINED REGVALUE ECHO/!REGSTR!>> "!REGFILE!.r"
    )
    ENDLOCAL
    ::
    :: Sort data
    SET HIVECOUNT=0
    FOR %%C IN (*.r) DO SET /A HIVECOUNT+=1
    CLS &echo. &echo  Please wait while sorting %HIVECOUNT% registry keys...
    :: Filter out sortable hives (not containing multiline data)
    FOR %%A IN (*.r) DO (
    FINDSTR /M /E /L /I /C:,\ "%%A" >NUL 2>&1
    IF ERRORLEVEL 1 SORT /REC 65535 /R "%%A" /O "%%A0" &DEL /F /Q  "%%A"
    )
    :: Remove duplicate lines from sortable hives
    SETLOCAL ENABLEDELAYEDEXPANSION
    FOR %%A IN (*.r0) DO (
    SET "STR="
    FOR /F "TOKENS=1* DELIMS=]" %%H IN ('TYPE "%%A" ^| FIND /V /N ""') DO (
    SET "_KEEPLINE=Y"
    IF "!STR!"=="%%I" SET "_KEEPLINE="
    SET "STR=%%I"
    IF DEFINED _KEEPLINE ECHO/%%I >> "%%~nA.r"
    )
    )
    ENDLOCAL
    ::
    :: Sort keys
    FOR /F %%A IN ('DIR /B /O:-NE *.r') DO (
    ECHO/>> "%%A"
    COPY /Y /B output.r1 + "%%A" output.r1 >NUL 2>&1
    )
    ::
    :: Revert safeguarding for batch-troublesome characters.
    CLS &echo. &echo  Please wait while reverting safeguarding batch-troublesome characters...
    CALL :UNSAFEGUARD_CHARS output.r1 output.r2
    ::
    :: Job Done!
    COPY /Y "output.r2" "%FPATH%\%FNAME%_sorted.reg" >nul 2>&1
    CLS &echo. &echo  Done, output is in  "%FPATH%\%FNAME%_sorted.reg"
    :: Cleanup
    rem POPD &RD /S /Q "%TEMP%\%FNAME%_sorted"  &TITLE CMD & COLOR 0F
    PING localhost >nul 2>&1 &EXIT /B
    
    
    
    ::
    :: Internal functions
    ::
    :TRIM_STR
    :: Usage: CALL :TRIM_STR
    :: Description: Remove preceding and ending spaces from STR var
    IF "!STR:~-1!"==" " SET "STR=!STR:~0,-1!" & GOTO :TRIM_STR
    IF "!STR:~0,1!"==" " SET "STR=!STR:~1,-1!!STR:~-1!" & GOTO :TRIM_STR
    GOTO :eof
    ::END.TRIM_STR
    ::
    :SAFEGUARD_CHARS
    :: Usage: CALL :SAFEGUARD_CHARS source.file output.file
    :: Description: substitute [&] and [!] to prevent data loss - work around delayedexpansion limitation
    FOR /F "TOKENS=1* DELIMS=]" %%H IN ('TYPE "%1" ^| FIND /V /N ""') DO (
    SET "STR=%%I"
    IF DEFINED STR (CALL :SAFEGUARD_CHARS_SUBST %2) ELSE (CALL :SAFEGUARD_CHARS_EMPTY %2)
    )
    GOTO :eof
    :SAFEGUARD_CHARS_SUBST
    SET "STR=%STR:&=¬oO¬%"
    SET "STR=%STR:!=¬_.¬%"
    SETLOCAL ENABLEDELAYEDEXPANSION
    ECHO/!STR!>> "%1"
    ENDLOCAL
    GOTO :eof
    :SAFEGUARD_CHARS_EMPTY
    ECHO/>> "%1"
    GOTO :eof
    ::END.SAFEGUARD_CHARS
    ::
    :UNSAFEGUARD_CHARS
    :: Usage: CALL :UNSAFEGUARD_CHARS source.file output.file
    :: Description: revert substitute [&] and [!] to prevent data loss - work around delayedexpansion limitation
    FOR /F "TOKENS=1* DELIMS=]" %%H IN ('TYPE "%1" ^| FIND /V /N ""') DO (
    SET "STR=%%I"
    IF DEFINED STR (CALL :UNSAFEGUARD_CHARS_SUBST %2) ELSE (CALL :UNSAFEGUARD_CHARS_EMPTY %2)
    )
    GOTO :eof
    :UNSAFEGUARD_CHARS_SUBST
    SET STR=%STR:¬oO¬=&%
    SET STR=%STR:¬_.¬=!%
    SETLOCAL ENABLEDELAYEDEXPANSION
    ECHO/!STR!>> "%1"
    ENDLOCAL
    GOTO :eof
    :UNSAFEGUARD_CHARS_EMPTY
    ECHO/>> "%1"
    GOTO :eof
    ::END.UNSAFEGUARD_CHARS
    ::
    :STRIPSYMBOLS_INTO_STR_VAR
    :: Usage: CALL :STRIPSYMBOLS_INTO_STR_VAR INPUT
    :: Description: Remove invalid in filename \ / : * ? " < > | and batch symbols ^ & ' ` @ { } [ ]  , $  =  ! - # ( ) % . + ~
    :: from INPUT into hardcoded STR variable. Adapted from _getname by Theoutcaste @ http://forums.techguy.org
    SET "STR=%*"
    SETLOCAL DISABLEDELAYEDEXPANSION
    SET "STR=%STR:!=%"
    ENDLOCAL &SET "STR=%STR%"
    SET "STR=%STR:"=%"
    SETLOCAL ENABLEDELAYEDEXPANSION
    FOR %%I IN ( \ / : ^< ^> ^| ^^ ^& ' ` @ { } [ ] $  - # ^( ^) ^%% . + ) DO SET "STR=!STR:%%I=!"
    :_STRIPSYMBOLS_SUB1
    SET "STR1="
    FOR /F "TOKENS=1* DELIMS=*?,;=~" %%J IN ("%STR%") DO SET "STR=%%J%%K" &SET "STR1=%%J" &SET "STR2=%%K"
    IF NOT "%STR2%"=="" GOTO :_STRIPSYMBOLS_SUB1
    IF NOT DEFINED STR1 SET STR=ERROR_INVALID_NAME
    ENDLOCAL &SET "STR=%STR%"
    :: trim and replace space with _
    FOR /F "TOKENS=* DELIMS= " %%S IN ("%STR%") DO SET STR=%%S
    SET "STR=%STR: =_%"
    GOTO :EOF
    ::END.STRIPSYMBOLS_INTO_STR_VAR
    
    Beats doing it by hand next time, even with synwrite / cudatext packed with plugins and great keyboard shortcuts and multi-cursor feature
    I think I never shared them publicly before, but circulated them among friends and coworkers for more than a decade and then forgotten for couple more years :)