Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text classification : Target 4294967295 is out of bounds (CPU) #2369

Closed
AlbelTec opened this issue Nov 12, 2022 · 18 comments
Closed

Text classification : Target 4294967295 is out of bounds (CPU) #2369

AlbelTec opened this issue Nov 12, 2022 · 18 comments
Assignees
Labels
Priority:0 Work that we can't release without Reported by: Customer
Milestone

Comments

@AlbelTec
Copy link

AlbelTec commented Nov 12, 2022

to be linked to : #2369 (comment)

System Information (please complete the following information):

  • Model Builder Version (available in Manage Extensions dialog): 16.14.0.2255902
  • Visual Studio Version : 2022

Describe the bug

  • On which step of the process did you run into an issue: when starting the train step
  • Clear description of the problem: when strating the trainer, I'm geeting this error

To Reproduce
Steps to reproduce the behavior:

  1. Go to 'Train step'
  2. Click on 'Start training'
  3. See error in pop up window

Expected behavior
A clear and concise description of what you expected to happen and what is causing this error

Screenshots

Target 4294967295 is out of bounds.
Exception raised from nll_loss_out_frame at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\LossNLL.cpp:230 (most recent call first):
00007FFF91B7A4C200007FFF91B7A460 c10.dll!c10::Error::Error [<unknown file> @ <unknown line number>]
00007FFF91B53ED500007FFF91B53E60 c10.dll!c10::IndexError::IndexError [<unknown file> @ <unknown line number>]
00007FFEC4151FB400007FFEC414CE60 torch_cpu.dll!at::native::multi_margin_loss_cpu_out [<unknown file> @ <unknown line number>]
00007FFEC41558F300007FFEC414CE60 torch_cpu.dll!at::native::multi_margin_loss_cpu_out [<unknown file> @ <unknown line number>]
00007FFEC415773400007FFEC41576D0 torch_cpu.dll!at::native::structured_nll_loss_forward_out_cpu::impl [<unknown file> @ <unknown line number>]
00007FFEC499C3DE00007FFEC498B710 torch_cpu.dll!at::cpu::zero_ [<unknown file> @ <unknown line number>]
00007FFEC49619AE00007FFEC4936730 torch_cpu.dll!at::cpu::bucketize_outf [<unknown file> @ <unknown line number>]
00007FFEC459B89000007FFEC45474F0 torch_cpu.dll!at::_ops::zeros_out::redispatch [<unknown file> @ <unknown line number>]
00007FFEC472308300007FFEC4722FE0 torch_cpu.dll!at::_ops::nll_loss_forward::redispatch [<unknown file> @ <unknown line number>]
00007FFEC54BAFA300007FFEC533A050 torch_cpu.dll!torch::autograd::GraphRoot::apply [<unknown file> @ <unknown line number>]
00007FFEC54860F200007FFEC533A050 torch_cpu.dll!torch::autograd::GraphRoot::apply [<unknown file> @ <unknown line number>]
00007FFEC46D698C00007FFEC46D6800 torch_cpu.dll!at::_ops::nll_loss_forward::call [<unknown file> @ <unknown line number>]
00007FFEC4157F0F00007FFEC4157E90 torch_cpu.dll!at::native::nll_loss [<unknown file> @ <unknown line number>]
00007FFEC4B1B6B200007FFEC4B17680 torch_cpu.dll!at::compositeimplicitautograd::where [<unknown file> @ <unknown line number>]
00007FFEC4AFAA5D00007FFEC4ACFD00 torch_cpu.dll!at::compositeimplicitautograd::broadcast_to [<unknown file> @ <unknown line number>]
00007FFEC47A0C6F00007FFEC47A0AE0 torch_cpu.dll!at::_ops::nll_loss::call [<unknown file> @ <unknown line number>]
00007FFEC415888F00007FFEC4157F80 torch_cpu.dll!at::native::nll_loss_nd [<unknown file> @ <unknown line number>]
00007FFEC4B1B6E200007FFEC4B17680 torch_cpu.dll!at::compositeimplicitautograd::where [<unknown file> @ <unknown line number>]
00007FFEC4AFAACD00007FFEC4ACFD00 torch_cpu.dll!at::compositeimplicitautograd::broadcast_to [<unknown file> @ <unknown line number>]
00007FFEC45E142F00007FFEC45E12A0 torch_cpu.dll!at::_ops::nll_loss_nd::call [<unknown file> @ <unknown line number>]
00007FFEC415653F00007FFEC4156250 torch_cpu.dll!at::native::cross_entropy_loss [<unknown file> @ <unknown line number>]
00007FFEC4B1968100007FFEC4B17680 torch_cpu.dll!at::compositeimplicitautograd::where [<unknown file> @ <unknown line number>]
00007FFEC4AFAB5200007FFEC4ACFD00 torch_cpu.dll!at::compositeimplicitautograd::broadcast_to [<unknown file> @ <unknown line number>]
00007FFEC4786B2300007FFEC4786980 torch_cpu.dll!at::_ops::cross_entropy_loss::call [<unknown file> @ <unknown line number>]
00007FFEC3E7FC7100007FFEC3E7FC40 torch_cpu.dll!at::cross_entropy_loss [<unknown file> @ <unknown line number>]
00007FFF30895E0500007FFF30895C60 LibTorchSharp.DLL!THSNN_cross_entropy [<unknown file> @ <unknown line number>]
00007FFF39B3F754 <unknown symbol address> !<unknown symbol> [<unknown file> @ <unknown line number>]

Additional context
Add any other context about the problem here.

@AlbelTec AlbelTec changed the title Text classification : error when starting training Text classification : error when starting CPU training Nov 12, 2022
@beccamc
Copy link
Contributor

beccamc commented Nov 14, 2022

It looks like we've gotten two of these - #2368. @LittleLittleCloud any ideas about possible root cause?

@beccamc
Copy link
Contributor

beccamc commented Nov 14, 2022

We seem to be hitting max int. @AlbelTec Can you give more details about your dataset? What is the size?

@AlbelTec
Copy link
Author

AlbelTec commented Nov 16, 2022

Hi @beccamc Sorry for the delay I was off. The dataset is very simple based on financial emails (text & label) for multiclassification purpose. The dataset contains 2588 texts, labels. each text could contain more than 2000 characters.

@beccamc
Copy link
Contributor

beccamc commented Nov 16, 2022

@JakeRadMSFT Thoughts? This doesn't sound like a very large dataset causing the problem.

@AlbelTec
Copy link
Author

@beccamc I did a small test where I tried once again but with very small dataset (16 rows) and I'm not getting the error. I'm wondering if long texts (more than 2000 characters) could raise the issue.

@beccamc
Copy link
Contributor

beccamc commented Nov 16, 2022

@AlbelTec Are you able to share your dataset?

@AlbelTec
Copy link
Author

@beccamc Unfortunately not as it contains sensitive data. Actually, texts are emails and aren't cleaned up for purpose (signatures are included as well as all replies / forwards).

@beccamc beccamc changed the title Text classification : error when starting CPU training Text classification : Target 4294967295 is out of bounds (CPU) Nov 30, 2022
@Soarc
Copy link

Soarc commented Dec 25, 2022

@beccamc Same error. I can share mine :)
test-data.csv

@beccamc
Copy link
Contributor

beccamc commented Jan 3, 2023

@v-Hailishi Can you try to repro with Soarc's dataset?

@v-Hailishi
Copy link

@beccamc By using Soarc's dataset test-data.csv, I can repro this issue on the latest main build 16.14.1.2262701

image

@beccamc
Copy link
Contributor

beccamc commented Jan 4, 2023

@LittleLittleCloud Can you take a look at this?

@beccamc beccamc added this to the February 2023 milestone Jan 4, 2023
@LittleLittleCloud
Copy link
Contributor

LittleLittleCloud commented Jan 5, 2023

4294967295 = 2^32 -1, so it should be caused by a wrong type casting?

After examining the code base, this place looks really suspicious.

https://github.com/dotnet/machinelearning/blob/9d798f1bb3fb17fe97eba77a694c35e2cb46a4b7/src/Microsoft.ML.TorchSharp/NasBert/TextClassificationTrainer.cs#L110

when target is 0, target - 1 will be 4294967295 after casting from uint to long

@michaelgsharp Would you take a closer look at this issue, especially verify if TextClassification still works if one of target/label is 0? Or the target of text classification should never be smaller than 1.

Update

The root cause is MapValueToKey will produce a key whose value is 0 when the value is 'NaN' or not exist in term map. And currently TextClassification will be break on all dataset which contains a 0-value key as label.

I created an issue in ml.net repo, in the meanwhile, a temp fix in model builder can be filter out rows where label is nan/empty.

@LittleLittleCloud LittleLittleCloud added the Priority:0 Work that we can't release without label Jan 5, 2023
@LittleLittleCloud
Copy link
Contributor

Scale up this issue to Priority:0 as this issue might affect all text-classification scenarios.

@beccamc
Copy link
Contributor

beccamc commented Jan 5, 2023

Can we get a repro with just ML.NET?

@luisquintanilla
Copy link
Contributor

Closing this issue since it should be resolved in the framework. Tracking issue dotnet/machinelearning#6534

@scottyboiler
Copy link

@luisquintanilla - Im confused by the comment stating that this "should be resolved in the framework". I am encountering this issue today. Can you clarify the fix? Thank you.

@luisquintanilla
Copy link
Contributor

@luisquintanilla - Im confused by the comment stating that this "should be resolved in the framework". I am encountering this issue today. Can you clarify the fix? Thank you.

@scottyboiler Model Builder is tooling for the ML.NET Framework (Microsoft.ML set of NuGet packages). The issue is at the framework level, not tooling. Therefore, fixing it there would also fix it for Model Builder. I hope that clarifies it.

@scottyboiler
Copy link

@luisquintanilla clarifying ML.NET fw vs. .NET fw is a helpful distinction. Thanks for the quick response.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Priority:0 Work that we can't release without Reported by: Customer
Projects
None yet
Development

No branches or pull requests

8 participants