dxc-technology / dxc-industrialized-ai-starter Goto Github PK
View Code? Open in Web Editor NEWIndustrialized AI Starter
License: Apache License 2.0
Industrialized AI Starter
License: Apache License 2.0
Generalizing the conversion of Column names issue in clean_dataframe, write_raw_data, pipeline functions.
When the function is writing source code to algoritima, then there is an issue in that source code.
Describe the bug
I am unable to run AI experiments for Time-Series problems
To Reproduce
When I passed timeseries as a parameter, I got an error as below:
I tried to pick up an example from https://nbviewer.jupyter.org/github/dxc-technology/DXC-Industrialized-AI-Starter/blob/c58754247060262ac0949396e48f71861cb79d4e/Examples/Time_series_Model.ipynb
on setting the value : "model": 'timeseries',
The timeseries values are not displaying as expected. Instead it shows the same value for all predictions
Please let me a way to handle timeseries problems
Expected behavior
Please create a function for Time-Series problem. Please revoke the functionality. As it looks like, it was already implemented
Screenshots
Added the images
Additional context
Add any other context about the problem here.
Do changes so AI_Starter can Integrate with Mongo dB Version 4.2.
Problem:
In the DXC_Industrialized_AI_Starter.ipynb notebook, there are a couple instances where it suggests you can upload a local file to colab. This doesn't work. The error message I get is :
MessageError: TypeError: Cannot read property '_uploadFiles' of undefined
MM/D/YYYY and MM/DD/YYYY are two popular US date format and it appears the ai.clean_dataframe() function cannot recognize them.
ParserError: Could not match input '10/1/2020' to any of the following formats: YYYY-MM-DD, YYYY-M-DD, YYYY-M-D, YYYY/MM/DD, YYYY/M/DD, YYYY/M/D, YYYY.MM.DD, YYYY.M.DD, YYYY.M.D, YYYYMMDD, YYYY-DDDD, YYYYDDDD, YYYY-MM, YYYY/MM, YYYY.MM, YYYY, W
Users in the boot camp is facing "as_matrix" error when calling ‘ai.explore_features(raw_data)’ in the AI Workbook.
Reasons to implement
name | title | about | labels | assignees |
---|---|---|---|---|
Transparency Request | metrics or statistics of differences between raw data and clean data. |
Describe the area of code that needs more transparency:
Display metrics or statistics that show the difference between raw data and clean data.
Describe the solution you'd like:
Describe the alternatives you've considered:
Additional context or comments:
Show the stats of raw and clean data as different columns
We should have metrics for categorical and numerical data, should also think about how to handle providing usable metrics for data sets with lots of features.
name | title | about | labels | assignees |
---|---|---|---|---|
Transparency Request | Rename the variable name in notebook |
Describe the area of code that needs more transparency:
Rename the variable name representing the dataset after cleaning the data (ex: Raw_data, Clean_data)
Describe the solution you'd like:
Describe the alternatives you've considered:
Additional context or comments:
Create an Issue template to capture the issues/changes for AI Ethics & Trustworthiness
In publish microservice, we need to create a new algorithm if that is not existing in algorithmia. It was not working in our previous code and did changes to make it run.
User is facing 'int' object has no attribute 'split' error in data pipeline function while working with numeric data operators.
In the documentation can you show an example of publishing a microservice from a custom model instead of a model generated from run_experiment() function? The custom model could be something as simple as a function that adds two number or appends text to an input.
Can you include a clustering model option in the experiment design?
name | title | about | labels | assignees |
---|---|---|---|---|
Transparency Request | Add logs to the crucial steps |
Describe the area of code that needs more transparency:
Add logs to the crucial steps that give feasibility to the user to revert changes
Describe the solution you'd like:
Describe the alternatives you've considered:
Additional context or comments:
Add logs to the crucial steps(aggregation) which give the feasibility to the user to revert changes if something goes wrong.
Log storage options, recommend config options (at least) for both Mongo Atlas DB and local storage
name | title | about | labels | assignees |
---|---|---|---|---|
Transparency Request | Add test cases links in contributing guide |
Describe the area of code that needs more transparency:
Add test cases links in contributing guide, revisit the contribute guide document, and do the necessary changes.
Describe the solution you'd like:
Describe the alternatives you've considered:
Additional context or comments:
Refactor the initial version of the code into sub modules.
name | title | about | labels | assignees |
---|---|---|---|---|
Transparency Request | Provide Auto-ML documentation link |
Describe the area of code that needs more transparency:
Provide Auto-ML documentation link in the user guide for running models
Describe the solution you'd like:
Describe the alternatives you've considered:
Additional context or comments:
If possible we should provide an easy way to expose deep links to the specific algorithms as part of our way to support Data Scientist making their work explainable
Describe the bug
Trying to upload an excel file (in Colab) but its failing with below error
read_data_frame_from_local_excel_file()
29 uploaded = files.upload()
30 excel_file_name = list(uploaded.keys())[0]
---> 31 df = pd.read_excel(io.BytesIO(uploaded[excel_file_name]))
32 return(df)
33
NameError: name 'io' is not defined
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Dataframe with the Excel data
Screenshots
If applicable, add screenshots to help explain your problem.
Additional context
Add any other context about the problem here.
After getting a "rework" status on a badge, doing the changes required and then resubmits gets you an error response that an assertion already exists:
<Response [400]> { "errorMessage": { "statusCode": 400, "exception": "BadRequest", "message": "Assertion already exists", "payload": { "evidence": "https://colab.research.google.com/drive/1oMUEVLS5x1netqWaL0kAt8FVR-ABeJ4U", "lastUpdated": "2021-02-10T15:37:33Z", "badge": "Create a Data Story", "status": "rework", "created": "2021-02-09T16:06:25Z", "comments": [ { "date": "2021-02-10T15:37:33Z", "comment": "Hello, nice work, but for the sample data set area, please upload your data set, not the iris file.", "email": "[email protected]" } ], "email": "[email protected]", "d1": "user:6366a530-391b-41be-bbff-6f372658afef", "d2": "badge:dd05bbdf-ad5b-469d-ab2c-4dd218fd68fe", "salt": "ecaa9028a8feb321be864bf98ac1ebe6", "reviewer": "[email protected]", "sk": "assertion", "pk": "assertion:93372602-81eb-47dc-bf05-5bc475c276b6" } } }
I am just wonder how do i test the api keys obtained from Mongodb using this library. Is it included in it ?
Where i refer to test what it meant is to check if the api keys are working and able to successfully establish a connection.
By default ai.run_experiment() produces a lot of output. Can we add a parameter to the function that defines a verbose and succinct mode? In verbose mode, the function produces the full output, but in succinct mode, only a summary is output. I recommend making succinct mode the default.
User facing issue with column names in "access_data_from_pipeline" function in below scenarios:
So fix need to be done to:
The module is deprecated in version 0.21 and removed in version 0.23. So hotfix need to be done to avoid problem in boot camp usage.
Describe the bug
the lib pandas.io.json.json_normalize is depreceated, recommendation to use pandas.json_normalize instead
To Reproduce
Steps to reproduce the behavior:
Expected behavior
no error from Pandas lib
Describe the bug
Pip install DXC_Industrialized_AI_Starter-2.3.9-py3-none-any.whl takes several minutes and times out in Google Colab due to multiple versions of libraries being downloaded
To Reproduce
Steps to reproduce the behavior:
run pip install
Expected behavior
The starter to load in at the most 10 minutes
Screenshots
If applicable, add screenshots to help explain your problem.
Additional context
Add any other context about the problem here.
User facing issue with ai.clean_dataframe function when
Describe the bug
distplot is a deprecated function and will be removed in a future version, we have to find an alternative to replace this function
To Reproduce
Steps to reproduce the behavior:
distplot is a deprecated function and will be removed in a future version. Please adapt your code to use either displot (a figure-level function with similar flexibility) or histplot (an axes-level function for histograms). Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be data, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
Expected behavior
WE have to find an alternative for this function
Additional context
N/A
name | title | about | labels | assignees |
---|---|---|---|---|
Transparency Request | Add encryption to the published microservice | Research |
Describe the area of code that needs more transparency:
Add encryption to the published microservice
Describe the solution you'd like:
Describe the alternatives you've considered:
Additional context or comments:
research levels of best-practice security for different types of data. We could offer a parameter mapped to predefined configurations for low, medium, high, and/or extra-high levels of security.
ai.read_data_frame_from_local_csv() needs to be able to accept files with other types of separators. I tried to import a pipe delimited csv file and could not set sep = '|' to be able to read the file
name | title | about | labels | assignees |
---|---|---|---|---|
Transparency Request | Include metadata details |
Describe the area of code that needs more transparency:
Include metadata details while writing data into MongoDB
Describe the solution you'd like:
Describe the alternatives you've considered:
Additional context or comments:
Include in the code as a minimum, but best in Mongo
Where did we got the data from, did we ran cleaner on the data
What version of data is
Where we stored the data
This is a manual entry
can you include a time series model option in the experiment design?
Add function to read Json file from local system using library and handle cases for nested dictionary Json file.
name | title | about | labels | assignees |
---|---|---|---|---|
Transparency Request | Add the drift function after pipeline |
Describe the area of code that needs more transparency:
Add the drift function after pipeline
Describe the solution you'd like:
Describe the alternatives you've considered:
Additional context or comments:
Calculate the drift between given two data sets over a period of time or between training sets
Resize the pictures
Convert all images into the same file format
Merging the images into a single file
Convert images into a CSV file
Few Changes to the CSV file
Loading the CSV file
Most of those attending the DXC-Industralized AI course have not used Colab before.
Some of the instructions for getting started in the "Set up the development environment" could be clearer.
e.g. "This code installs all the packages you'll need. Run it first."
Most people who have not used Colab before would not know how to do this.
Can you include a deep learning model option in the experiment design?
Explore LIME, SHAP Libraries and implement Interpretable Machine Learning Models in AI Library.
The link in the example that is supposed to list available data sets is broken
name | title | about | labels | assignees |
---|---|---|---|---|
Transparency Request | indicate the completeness or correctness of the data and show the outliers |
Describe the area of code that needs more transparency:
indicate the completeness or correctness of the data and show the outliers
Describe the solution you'd like:
Describe the alternatives you've considered:
Additional context or comments:
Logan comments:
Not just visualization, completeness, correctness, outliers, and other metrics should be saved statistics, too. We can’t force others to make their work explainable to an end-user, but we can ensure that their work is capable of being explained if they use our package.
Describe the bug
The code fails when I run the part responsible for importing ai from DXC (from dxc import ai)
To Reproduce
Steps to reproduce the behavior:
Expected behavior
The code should import the ai from DXC library without any issue
Additional context
Non
Publish DXC-Industrialized-AI-Starter google colab notebook file to GitHub.
By default ai.publish_microservice() produces a lot of output. Can we add a parameter to the function that defines a verbose and succinct mode? In verbose mode, the function produces the full output, but in succinct mode, only a summary is output. I recommend making succinct mode the default.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.