🌱 Open Source

🌍 Live Open Source Explorer

Explore live open-source projects and AI models.

Search public open-source repositories from GitHub and AI models from Hugging Face. Every page shows 10 results with clean pagination.

🔎 Live Search

Search live open-source data

Search GitHub repositories and Hugging Face models directly, then explore stars, downloads, source links and project details.

Reset Search
🔎
🌐

Try keywords like automation, CRM, analytics, chatbot, llama or workflow.

Choose where to search live data.

Live Results

GitHub Open Source Repositories

Search: linguistic-data

Page 1

Showing 10 results from 30

P

proycon/pynlpl

GitHub Python GNU General Public License v3.0

PyNLPl, pronounced as 'pineapple', is a Python library for Natural Language Processing. It contains various modules useful for common, and less common, NLP tasks. PyNLPl can be used for basic tasks such as the extraction of n-grams and frequency lists, and to build simple language model. There ar... Read more

★ 476 Forks 66 proycon Updated 11 Apr 2026
B

ChangdeDu/BraVL

GitHub Python MIT License

Code and Data for "Decoding Visual Neural Representations by Multimodal Learning of Brain-Visual-Linguistic Features"

★ 150 Forks 27 ChangdeDu Updated 15 Jun 2026
L

EticaAI/linguistic-datasets-portuguese

GitHub The Unlicense

Linguistic Datasets for Portuguese: Lista de conjuntos de dados linguísticos para língua portuguesa com licença flexíveis: banco de dados, lista de palavras, sinônimos, antônimos, dicionário temático, tesauro, linked data, semântica, ontologia e representação de conhecimento

★ 83 Forks 5 EticaAI Updated 02 Jun 2026
W

Maximax67/Words-CEFR-Dataset

GitHub Jupyter Notebook MIT License

A dataset mapping English words to CEFR levels based on the CEFR-J dataset, word lemmas, stems, parts of speech (POS), and frequency data from the N-Gram Google dataset. Ideal for NLP tasks, language proficiency assessment, and linguistic research.

★ 77 Forks 16 Maximax67 Updated 16 Jun 2026
C

clld/clld

GitHub Python Other

A web framework to display Cross Linguistic Linked Data.

★ 73 Forks 26 clld Updated 13 Jun 2026
F

proycon/folia

GitHub Python GNU General Public License v3.0

FoLiA: Format for Linguistic Annotation - FoLiA is a rich XML-based annotation format for the representation of language resources (including corpora) with linguistic annotations. A wide variety of linguistic annotations are supported, making FoLiA a useful format for NLP tasks and data interchan... Read more

★ 66 Forks 10 proycon Updated 22 May 2026
C

cldf/cldf

GitHub Python Apache License 2.0

CLDF: Cross-Linguistic Data Formats - the specification

★ 64 Forks 17 cldf Updated 09 May 2026
C

microsoft/CodeMixed-Text-Generator

GitHub Jupyter Notebook MIT License

This tool helps automatic generation of grammatically valid synthetic Code-mixed data by utilizing linguistic theories such as Equivalence Constant Theory and Matrix Language Theory.

★ 61 Forks 11 microsoft Updated 11 Jun 2026
U

vered1986/UnsupervisedHypernymy

GitHub Python Other

Data and code for the experiments in: "Hypernyms under Siege: Linguistically-motivated Artillery for Hypernymy Detection". Vered Shwartz, Enrico Santus and Dominik Schlechtweg. EACL 2017.

★ 51 Forks 13 vered1986 Updated 25 Mar 2026
L

dowobeha/ldc_downloader

GitHub Shell GNU General Public License v3.0

Script to download corpora from the Linguistic Data Consortium (LDC)

★ 34 Forks 10 dowobeha Updated 14 Jan 2026
Pagination Page 1 of 3

10 results on this page · 30 total found

Showing first 30 accessible GitHub results.