Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add specialized FrozenSet implementations for byte and char #110842

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

MihaZupan
Copy link
Member

@MihaZupan MihaZupan commented Dec 19, 2024

Fixes #110783

Adds 3 new FrozenSet imlpementations:

  • ByteFrozenSet - any set of bytes when using the default comparer
  • Latin1CharFrozenSet - set of chars <= 255 when using the default comparer
  • PerfectHashCharFrozenSet - any set of chars when using the default comparer

I extracted the lookup logic from ProbabilisticMapState in SearchValues to a searate helper so that there's practically no code duplication between the two.


Benchmark
public class Bench
{
    private static string Chars() => string.Concat(Enumerable.Range(0, 65536).Select(i => (char)i).Where(c => char.IsAscii(c) && char.IsSymbol(c)));
    private static readonly SearchValues<char> s_sv = SearchValues.Create(Chars());
    private static readonly FrozenSet<char> s_fs = FrozenSet.Create<char>(Chars());
    private static readonly string s_text = File.ReadAllText(@"C:\Users\mihaz\Downloads\gutenberg.org_files_1661_1661-0.txt");

    [Benchmark]
    public int CountSearchValues()
    {
        int count = 0;
        foreach (char c in s_text)
        {
            if (s_sv.Contains(c)) count++;
        }
        return count;
    }

    [Benchmark]
    public int CountFrozenSet()
    {
        int count = 0;
        foreach (char c in s_text)
        {
            if (s_fs.Contains(c)) count++;
        }
        return count;
    }
}

char.IsSymbol (Latin1CharFrozenSet)

Method Toolchain Mean Error Ratio Code Size
CountSearchValues 229.9 us 3.19 us 114 B
CountFrozenSet main 585.6 us 5.93 us 1.00 210 B
CountFrozenSet pr 228.7 us 2.73 us 0.39 108 B

char.IsAscii && char.IsSymbol (PerfectHashCharFrozenSet)

Method Toolchain Mean Error Ratio Code Size
CountSearchValues 265.4 us 3.34 us 102 B
CountFrozenSet main 1,508.3 us 20.19 us 1.00 141 B
CountFrozenSet pr 263.6 us 3.53 us 0.17 102 B

char.IsSymbol for bytes (ByteFrozenSet)

Method Toolchain Mean Error Ratio Code Size
CountSearchValues 193.5 us 2.60 us 74 B
CountFrozenSet main 1,529.5 us 22.44 us 1.00 131 B
CountFrozenSet pr 193.0 us 2.33 us 0.13 74 B

@MihaZupan MihaZupan added this to the 10.0.0 milestone Dec 19, 2024
@MihaZupan MihaZupan self-assigned this Dec 19, 2024
@MihaZupan
Copy link
Member Author

cc: @dotnet/area-system-collections (the bot was lazy)

@stephentoub
Copy link
Member

Can you point to some places where FrozenSet<byte> and FrozenSet<char> will be used?

@MihaZupan
Copy link
Member Author

Should be similar to SearchValues<byte>/SearchValues<char>, where you'd previously be using e.g. HashSet / string.Contains / bool[128].
Many uses would benefit from using SearchValues's ContainsAny though as they're just doing scalar loops around Contains themselves.

As a random example, here's one in Uri, though we could of course use SearchValues since we know the implementation:

internal static bool IsNotSafeForUnescape(char ch)

My main motivation was making the answer to "what should I use for Contains(T) checks" simply FrozenSet<T>, given that the interesting part of the implementation can be rather easily shared with SearchValues in this case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consider improving FrozenSet for byte/char
4 participants